Kai Chen

Stata command to convert string GVKEY to numerical GVKEY or vice versa

Posted on July 23, 2018 by Kai Chen

The default type of GVEKY in Compustat is string. Sometimes, we need it to be a numerical type in Stata (e.g., when we want to use the super handy command tsset). The command to convert string GVKEY to numerical GVEKY is very simple:

destring gvkey, replace

The command to revert numerical GVKEY to string GVKEY with leading zeros is as follows:

tostring gvkey, replace format(%06.0f)

Posted in Stata | 1 Comment

Stata command to calculate the area under ROC curve

Posted on July 19, 2018 by Kai Chen

If we want to evaluate the predictive ability of a logit or probit model, Kim and Skinner (2012, JAE, Measuring securities litigation risk) suggest that

A better way of comparing the predictive ability of different models is to use the Receiver Operating Characteristic, or ROC curve (e.g., Hosmer and Lemeshow, 2000, Chapter 5). This curve ‘‘plots the probability of detecting a true signal (sensitivity) and false signal (1—speciﬁcity) for the entire range of possible cutpoints’’ (p. 160, our emphasis). The area under the ROC curve (denoted AUC) provides a measure of the model’s ability to discriminate. A value of 0.5 indicates no ability to discriminate (might as well toss a coin) while a value of 1 indicates perfect ability to discriminate, so the effective range of AUC is from 0.5 to 1.0. Hosmer-Lemeshow (2000, p. 162) indicate that AUC of 0.5 indicates no discrimination, AUC of between 0.7 and 0.8 indicates acceptable discrimination, AUC of between 0.8 and 0.9 indicates excellent discrimination, and AUC greater than 0.9 is considered outstanding discrimination.

The Stata command to report AUC is as follows:

logit y x1 x2 or probit y x1 x2

lroc, nograph

The most recent edition of the book Kim and Skinner refer to is Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken, NJ: Wiley.

A technical note from Stata: lroc requires that the current estimation results be from logistic, logit, probit, or ivprobit.

A side question: what’s the difference between logistic and logit regression? Nick Cox’s short answer is: “same thing with different emphases in reporting.” (something like one gives you the odds ratios, the other gives you the log of the odds ratios.)—thanks to a post on Stack Overflow.

Posted in Stata | 1 Comment

Stata commands to calculate skewness

Posted on July 16, 2018 by Kai Chen

Suppose we are going to calculate the skewness of 12 monthly returns. The 12 returns may be stored in a row (Figure 1) or in a column (Figure 2). This post discusses how to calculate the skewness in these two situations. Please note there are several formulae for skewness out there, which may yield different results. This post uses the formula that yields the same skewness as the Stata command sum var, detail reports.

Figure 1: Returns are stored in a row

Figure 2: Returns are stored in a column

If returns are stored in a row

Stata does not provide a command to calculate the skewness in this situation. The following Stata commands will do the job.

egen ret_mean=rowmean(ret1-ret12)
egen n=rownonmiss(ret1-ret12)

foreach v in ret1 ret2 ret3 ret4 ret5 ret6 ret7 ret8 ret9 ret10 ret11 ret12 {
  ge `v'_m3=(`v'-ret_mean)^3
}
egen m3=rowtotal(ret1_m3-ret12_m3), missing
replace m3=m3/n

foreach v in ret1 ret2 ret3 ret4 ret5 ret6 ret7 ret8 ret9 ret10 ret11 ret12 {
  ge `v'_m2=(`v'-ret_mean)^2
}
egen m2=rowtotal(ret1_m2-ret12_m2), missing
replace m2=m2/n

ge ret_skew=m3*m2^(-3/2)

egen ret_mean=rowmean(ret1-ret12)

egen n=rownonmiss(ret1-ret12)

foreach v in ret1 ret2 ret3 ret4 ret5 ret6 ret7 ret8 ret9 ret10 ret11 ret12 {

ge `v'_m3=(`v'-ret_mean)^3

}

egen m3=rowtotal(ret1_m3-ret12_m3), missing

replace m3=m3/n

foreach v in ret1 ret2 ret3 ret4 ret5 ret6 ret7 ret8 ret9 ret10 ret11 ret12 {

ge `v'_m2=(`v'-ret_mean)^2

}

egen m2=rowtotal(ret1_m2-ret12_m2), missing

replace m2=m2/n

ge ret_skew=m3*m2^(-3/2)

If returns are stored in a column

Stata provides a command to calculate skewness in this situation (egen and skewness). However, the computation is extremely slow if we have millions of observations. I would suggest calculating the skewness manually as follows:

sort permno (add more variables here to identify a group)

by permno: egen ret_mean=mean(ret)
by permno: egen n=count(ret)

ge ret_m3=(ret-ret_mean)^3
by permno: egen m3=total(ret_m3)
replace m3=m3/n

ge ret_m2=(ret-ret_mean)^2
by permno: egen m2=total(ret_m2)
replace m2=m2/n

ge skewness=m3*m2^(-3/2)

sort permno (add more variables here to identify a group)

by permno: egen ret_mean=mean(ret)

by permno: egen n=count(ret)

ge ret_m3=(ret-ret_mean)^3

by permno: egen m3=total(ret_m3)

replace m3=m3/n

ge ret_m2=(ret-ret_mean)^2

by permno: egen m2=total(ret_m2)

replace m2=m2/n

ge skewness=m3*m2^(-3/2)

Posted in Stata | Leave a comment

Lawsuit data from Stanford Law School’s Securities Class Action Clearinghouse

Posted on July 15, 2018 by Kai Chen

The Python script in the original post has been removed as its use violates the Terms of Service of the data provider.

Stanford Law School’s Securities Class Action Clearinghouse is always happy to share the data (subject to a Non-Disclosure Agreement) with academic researchers for non-commercial research or analysis. If you have any data needs, please contact their SCAC Content Manager at scac@law.stanford.edu.

Posted in Python | Tagged Python | 14 Comments

Calculate idiosyncratic stock return volatility

Posted on December 6, 2017 by Kai Chen

I have noted two slightly different definitions of idiosyncratic stock return volatility in:

Campbell, J. Y. and Taksler, G. B. (2003), Equity Volatility and Corporate Bond Yields. The Journal of Finance, 58: 2321–2350. doi:10.1046/j.1540-6261.2003.00607.x
Rajgopal, S. and Venkatachalam, M. (2011), Financial reporting quality and idiosyncratic return volatility. Journal of Accounting and Economics, 51: 1–20. doi.org/10.1016/j.jacceco.2010.06.001.

The code in this post is used to calculate Campbell and Taksler’s (2003) idiosyncratic stock return volatility, but it can be easily modified for other definitions.

Specifically, this code requires an input dataset that includes two variables: permno and enddt, where enddt is the date of interest. This code will calculate the standard deviation of daily abnormal returns over the 180 calendar days before (and including) enddt. Abnormal returns will be calculated using four methods: (1) market-adjusted; (2) standard market model; (3) Fama-French three factors; and (4) Fama-French three factors as well as momentum. This code requires at least 21 return observations (one-month trading days) over that 180-day period for a permno to calculate its stock return volatility.

libname local "D:\Dropbox";

* Remote signon and upload input database including permno and date;
%let wrds=wrds-cloud.wharton.upenn.edu 4016;
options comamid=TCP remote=WRDS;
signon username=_prompt_;

rsubmit;

proc upload data=local.input out=input; run;

* Get raw returns and FF risk factors;
proc sql;
  create table rets as
  select a.*, b.date, b.ret, c.smb, c.hml, c.umd, d.vwretd as mktret, (b.ret-d.vwretd) as exret
  from input a left join crsp.dsf b
  on a.permno=b.permno and a.enddt-180<b.date<=a.enddt
  left join ff.factors_daily c
  on b.date=c.date
  left join crsp.dsi d
  on b.date=d.date
  order by a.permno, a.enddt, b.date;
quit;

* Estimate factor exposures;
proc printto log=junk; run;

proc reg data=rets edf outest=params noprint;
  by permno enddt;
  eq0: model exret=;
  eq1: model ret=mktret;
  eq2: model ret=mktret smb hml;
  eq3: model ret=mktret smb hml umd;
run;
proc printto; run;

* Compute abnormal returns for all models for each trading day;
data abrets; merge rets (in=a)
  params (where=(_model_='eq0')
     keep=permno enddt _model_ _rmse_ _p_ _edf_
     rename=(_rmse_=std0 _p_=p0 _edf_=edf0))
  params (where=(_model_='eq1')
     keep=permno enddt _model_ _rmse_ intercept mktret
     rename=(_rmse_=std1 intercept=alpha1 mktret=beta1))
  params (where=(_model_='eq2')
     keep=permno enddt _model_ _rmse_ intercept mktret smb hml
     rename=(_rmse_=std2 intercept=alpha2 mktret=beta2 smb=sminb2 hml=hminl2))
  params (where=(_model_='eq3')
     keep=permno enddt _model_ _rmse_ intercept mktret smb hml umd
     rename=(_rmse_=std3 intercept=alpha3 mktret=beta3 smb=sminb3 hml=hminl3 umd=umind3));
  by permno enddt;
  var0=std0**2; var1=std1**2;var2=std2**2;var3=std3**2;
  abret0=exret;
  expret1=alpha1+beta1*mktret; abret1=ret-expret1;
  expret2=alpha2+beta2*mktret+sminb2*smb+hminl2*hml; abret2=ret-expret2;
  expret3=alpha3+beta3*mktret+sminb3*smb+hminl3*hml+umind3*umd; abret3=ret-expret3;
  nobs=p0+edf0;  /*number of observations used in estimation*/
 drop p0 edf0 std0 std1 std2 std3 _model_ exret;
 if a and nobs>=21;
run;

proc sort data=abrets; by permno enddt date; run;

proc means data=abrets noprint;
  by permno enddt;
  output out=retvol
  std(abret0)=vol0 std(abret1)=vol1 std(abret2)=vol2 std(abret3)=vol3;
run;

* Download output dataset and remote signoff;
proc download data=retvol out=local.retvol; run;

endrsubmit;
signoff;

libname local "D:\Dropbox";

* Remote signon and upload input database including permno and date;

%let wrds=wrds-cloud.wharton.upenn.edu 4016;

options comamid=TCP remote=WRDS;

signon username=_prompt_;

rsubmit;

proc upload data=local.input out=input; run;

* Get raw returns and FF risk factors;

proc sql;

create table rets as

select a.*, b.date, b.ret, c.smb, c.hml, c.umd, d.vwretd as mktret, (b.ret-d.vwretd) as exret

from input a left join crsp.dsf b

on a.permno=b.permno and a.enddt-180<b.date<=a.enddt

left join ff.factors_daily c

on b.date=c.date

left join crsp.dsi d

on b.date=d.date

order by a.permno, a.enddt, b.date;

quit;

* Estimate factor exposures;

proc printto log=junk; run;

proc reg data=rets edf outest=params noprint;

by permno enddt;

eq0: model exret=;

eq1: model ret=mktret;

eq2: model ret=mktret smb hml;

eq3: model ret=mktret smb hml umd;

run;

proc printto; run;

* Compute abnormal returns for all models for each trading day;

data abrets; merge rets (in=a)

params (where=(_model_='eq0')

keep=permno enddt _model_ _rmse_ _p_ _edf_

rename=(_rmse_=std0 _p_=p0 _edf_=edf0))

params (where=(_model_='eq1')

keep=permno enddt _model_ _rmse_ intercept mktret

rename=(_rmse_=std1 intercept=alpha1 mktret=beta1))

params (where=(_model_='eq2')

keep=permno enddt _model_ _rmse_ intercept mktret smb hml

rename=(_rmse_=std2 intercept=alpha2 mktret=beta2 smb=sminb2 hml=hminl2))

params (where=(_model_='eq3')

keep=permno enddt _model_ _rmse_ intercept mktret smb hml umd

rename=(_rmse_=std3 intercept=alpha3 mktret=beta3 smb=sminb3 hml=hminl3 umd=umind3));

by permno enddt;

var0=std0**2; var1=std1**2;var2=std2**2;var3=std3**2;

abret0=exret;

expret1=alpha1+beta1*mktret; abret1=ret-expret1;

expret2=alpha2+beta2*mktret+sminb2*smb+hminl2*hml; abret2=ret-expret2;

expret3=alpha3+beta3*mktret+sminb3*smb+hminl3*hml+umind3*umd; abret3=ret-expret3;

nobs=p0+edf0; /*number of observations used in estimation*/

drop p0 edf0 std0 std1 std2 std3 _model_ exret;

if a and nobs>=21;

run;

proc sort data=abrets; by permno enddt date; run;

proc means data=abrets noprint;

by permno enddt;

output out=retvol

std(abret0)=vol0 std(abret1)=vol1 std(abret2)=vol2 std(abret3)=vol3;

run;

* Download output dataset and remote signoff;

proc download data=retvol out=local.retvol; run;

endrsubmit;

signoff;

Posted in SAS | 8 Comments

Commonly used Stata commands to deal with potential outliers

Posted on November 10, 2017 by Kai Chen

In accounting archival research, we often take it for granted that we must do something to deal with potential outliers before we run a regression. The commonly used methods are: truncate, winsorize, studentized residuals, and Cook’s distance. I discuss in this post which Stata command to use to implement these four methods.

First of all, why and how we deal with potential outliers is perhaps one of the messiest issues that accounting researchers will encounter, because no one ever gives a definitive and satisfactory answer. In my opinion, only outliers resulting from apparent data errors should be deleted from the sample. That said, this post is not going to answer that messy question; instead, the purpose of this post is to summarize the Stata commands for commonly used methods of dealing with outliers (even if we are not sure whether these methods are appropriate—we all know that is true in accounting research!). Let’s start.

Truncate and winsorize

In my opinion, the best Stata commands to do truncate and winsorize are truncateJ and winsorizeJ written by Judson Caskey. I will save time to explain why, but simply highly recommend his work. Please see his website here.

To install these two user-written commands, you can type:

net from https://sites.google.com/site/judsoncaskey/data
net install utilities.pkg

After the installation, you can type help truncateJ or help winsorizeJ to learn how to use these two commands.

Studentized residuals

The first step is to run a regression without specifying any vce parameter in Stata (i.e., not using robust or clustered error terms). Suppose the dependent variable is y, and independent variables are x1 and x2. The first step should look like this:

regress y x1 x2

Then, use the predict command:

predict rstu if e(sample), rstudent

If the absolute value of rstu exceed certain critical values, the data point will be considered as an outlier and be deleted from the final sample. Stata’s manual indicates that “studentized residuals can be interpreted as the t statistic for testing the significance of a dummy variable equal to 1 in the observation in question and 0 elsewhere. Such a dummy variable would effectively absorb the observation and so remove its influence in determining the other coefficients in the model.” To be honest, I do not fully understand this explanation, but since rstu is a t statistics, the critical value for a traditional significance level should be applied, for example, 1.96 (or 2) for 5% significance level. That’s why in literature we often see that data points with absolute values of studentized residuals greater than 2 will be deleted. Some papers use the critical value of 3, which corresponds to 0.27% significance level, and seems to me not very reasonable.

Now use the following command to drop “outliers” based on the critical value of 2:

drop if abs(rstu) > 2

The last step is to re-run the regression, but this time we can add appropriate vce parameters to address additional issues such as heteroskedasticity:

regress y x1 x2, vce(robust), or

regress y x1 x2, vce(cl gvkey)

Cook’s distance

This method is similar to studentized residuals. We predict a specific residual, namely Cook’s distance, and then delete any data points with Cook’s distance greater than 4/N (Cook’s distance is always positive).

regress y x1 x2

predict cooksd if e(sample), cooksd

drop if cooksd > critical value

Next, re-run the regression with appropriate vce parameters:

regress y x1 x2, vce(robust), or

regress y x1 x2, vce(cl gvkey)

Lastly, I thank the authors of the following articles which I benefit from:

https://www3.nd.edu/~rwilliam/stats2/l24.pdf

https://www.stat-d.si/mz/mz16/coend16.pdf

A more formal and complete econometrics book is Belsley, D. A., E. Kuh, and R. E. Welsch. 1980. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley.

Posted in Stata | 5 Comments

Use Python to extract URLs to HTML-format SEC filings on EDGAR

Posted on October 25, 2017 by Kai Chen

[Update on 2020-06-26] Eduardo has made a significant improvement to the code. Now you can specify a starting date and download the index file during the period from that starting date to the most recent date. I expect it to be very useful for many readers of my website. Eduardo has kindly shared the code in the comment. Thank you, Eduardo!

[Update on 2019-08-07] From time to time, some readers informed that the first-part code seemingly stopped at certain quarters. I don’t know the exact reason (perhaps a server-side issue). I never encountered the issue. I would suggest that you just try again later. I also share a Dropbox link from which you can download the first-part results (as of 2019-08-07; 2.4GB) in the CSV format (link). Please note—as I have explained in my original post, the URL contained in the downloadable CSV is not the URL to the HTML-format filing; it is just the URL to an index page. You need to select your sample and go through the second-part code to get the URL to the HTML-format filing.

[Original post] I wrote two posts to describe how to download TXT-format SEC filings on EDGAR:

Although TXT-format files have benefits of easy further handling, they are oftentimes not well formatted and thus hard to read. A HTML-format 10-K is more pleasing to eyes. Actually, SEC also provides the paths (namely, URLs) to HTML-format filings. With the path, we can open a HTML-format filing in a web browser, or further download the filing as a PDF.

There remain two parts in the Python code. In the first part, we need download the path data. Instead of using master.idx in the above two posts, we need use crawler.idx for this task. The path we get will be a URL like this:

https://www.sec.gov/Archives/edgar/data/859747/0001477932-16-007969-index.htm

Note that the path we get is a URL to an index page, not a URL to the HTML-format 10-Q in this example. To get the direct URL to the HTML-format 10-Q, we have to go one-level deeper. The second part of the Python code is used to go that deeper and extract the direct URL to the main body of the Form (the URL embedded in the first row in more than 99% cases). The code also extracts such information as filing date and period of report on the index page. The code writes the output (including filing date, period of report and direct URL) in log.csv. The following is an output example—the first URL is the path we get in the first part of the code; the second URL is the direct URL to the HTML-format Form.

13780110,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2014,https://www.sec.gov/Archives/edgar/data/5272/0000005272-14-000007-index.htm,2017-10-11 23:44:42,2017-10-11 23:44:50,2014-05-05,2014-03-31,https://www.sec.gov/Archives/edgar/data/5272/000000527214000007/maindocument001.htm
16212215,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,03nov2016,https://www.sec.gov/Archives/edgar/data/5272/0000005272-16-000052-index.htm,2017-10-11 23:44:51,2017-10-11 23:44:58,2016-11-03,2016-09-30,https://www.sec.gov/Archives/edgar/data/5272/000000527216000052/maindocument001.htm
6772655,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2007,https://www.sec.gov/Archives/edgar/data/5272/0000950123-07-007211-index.htm,2017-10-11 23:44:59,2017-10-11 23:45:05,2007-05-10,2007-03-31,https://www.sec.gov/Archives/edgar/data/5272/000095012307007211/y32085e10vq.htm
5671285,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2006,https://www.sec.gov/Archives/edgar/data/5272/0000950123-06-006104-index.htm,2017-10-11 23:45:07,2017-10-11 23:45:14,2006-05-10,2006-03-31,https://www.sec.gov/Archives/edgar/data/5272/000095012306006104/y19465e10vq.htm
10831058,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2011,https://www.sec.gov/Archives/edgar/data/5272/0001047469-11-004647-index.htm,2017-10-11 23:45:15,2017-10-11 23:45:20,2011-05-05,2011-03-31,https://www.sec.gov/Archives/edgar/data/5272/000104746911004647/a2203832z10-q.htm

13780110,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2014,https://www.sec.gov/Archives/edgar/data/5272/0000005272-14-000007-index.htm,2017-10-11 23:44:42,2017-10-11 23:44:50,2014-05-05,2014-03-31,https://www.sec.gov/Archives/edgar/data/5272/000000527214000007/maindocument001.htm

16212215,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,03nov2016,https://www.sec.gov/Archives/edgar/data/5272/0000005272-16-000052-index.htm,2017-10-11 23:44:51,2017-10-11 23:44:58,2016-11-03,2016-09-30,https://www.sec.gov/Archives/edgar/data/5272/000000527216000052/maindocument001.htm

6772655,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2007,https://www.sec.gov/Archives/edgar/data/5272/0000950123-07-007211-index.htm,2017-10-11 23:44:59,2017-10-11 23:45:05,2007-05-10,2007-03-31,https://www.sec.gov/Archives/edgar/data/5272/000095012307007211/y32085e10vq.htm

5671285,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2006,https://www.sec.gov/Archives/edgar/data/5272/0000950123-06-006104-index.htm,2017-10-11 23:45:07,2017-10-11 23:45:14,2006-05-10,2006-03-31,https://www.sec.gov/Archives/edgar/data/5272/000095012306006104/y19465e10vq.htm

10831058,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2011,https://www.sec.gov/Archives/edgar/data/5272/0001047469-11-004647-index.htm,2017-10-11 23:45:15,2017-10-11 23:45:20,2011-05-05,2011-03-31,https://www.sec.gov/Archives/edgar/data/5272/000104746911004647/a2203832z10-q.htm

The first part of the code:

# Generate the list of index files archived in EDGAR since start_year (earliest: 1993) until the most recent quarter
import datetime

current_year = datetime.date.today().year
current_quarter = (datetime.date.today().month - 1) // 3 + 1
start_year = 1993
years = list(range(start_year, current_year))
quarters = ['QTR1', 'QTR2', 'QTR3', 'QTR4']
history = [(y, q) for y in years for q in quarters]
for i in range(1, current_quarter + 1):
    history.append((current_year, 'QTR%d' % i))
urls = ['https://www.sec.gov/Archives/edgar/full-index/%d/%s/crawler.idx' % (x[0], x[1]) for x in history]
urls.sort()

# Download index files and write content into SQLite
import sqlite3
import requests

con = sqlite3.connect('edgar_htm_idx.db')
cur = con.cursor()
cur.execute('DROP TABLE IF EXISTS idx')
cur.execute('CREATE TABLE idx (conm TEXT, type TEXT, cik TEXT, date TEXT, path TEXT)')

for url in urls:
    lines = requests.get(url).text.splitlines()
    nameloc = lines[7].find('Company Name')
    typeloc = lines[7].find('Form Type')
    cikloc = lines[7].find('CIK')
    dateloc = lines[7].find('Date Filed')
    urlloc = lines[7].find('URL')
    records = [tuple([line[:typeloc].strip(), line[typeloc:cikloc].strip(), line[cikloc:dateloc].strip(),
                      line[dateloc:urlloc].strip(), line[urlloc:].strip()]) for line in lines[9:]]
    cur.executemany('INSERT INTO idx VALUES (?, ?, ?, ?, ?)', records)
    print(url, 'downloaded and wrote to SQLite')

con.commit()
con.close()

# Write SQLite database to Stata
import pandas
from sqlalchemy import create_engine

engine = create_engine('sqlite:///edgar_htm_idx.db')
with engine.connect() as conn, conn.begin():
    data = pandas.read_sql_table('idx', conn)
    data.to_stata('edgar_htm_idx.dta')

# Generate the list of index files archived in EDGAR since start_year (earliest: 1993) until the most recent quarter

import datetime

current_year = datetime.date.today().year

current_quarter = (datetime.date.today().month - 1) // 3 + 1

start_year = 1993

years = list(range(start_year, current_year))

quarters = ['QTR1', 'QTR2', 'QTR3', 'QTR4']

history = [(y, q) for y in years for q in quarters]

for i in range(1, current_quarter + 1):

history.append((current_year, 'QTR%d' % i))

urls = ['https://www.sec.gov/Archives/edgar/full-index/%d/%s/crawler.idx' % (x[0], x[1]) for x in history]

urls.sort()

# Download index files and write content into SQLite

import sqlite3

import requests

con = sqlite3.connect('edgar_htm_idx.db')

cur = con.cursor()

cur.execute('DROP TABLE IF EXISTS idx')

cur.execute('CREATE TABLE idx (conm TEXT, type TEXT, cik TEXT, date TEXT, path TEXT)')

for url in urls:

lines = requests.get(url).text.splitlines()

nameloc = lines[7].find('Company Name')

typeloc = lines[7].find('Form Type')

cikloc = lines[7].find('CIK')

dateloc = lines[7].find('Date Filed')

urlloc = lines[7].find('URL')

records = [tuple([line[:typeloc].strip(), line[typeloc:cikloc].strip(), line[cikloc:dateloc].strip(),

line[dateloc:urlloc].strip(), line[urlloc:].strip()]) for line in lines[9:]]

cur.executemany('INSERT INTO idx VALUES (?, ?, ?, ?, ?)', records)

print(url, 'downloaded and wrote to SQLite')

con.commit()

con.close()

# Write SQLite database to Stata

import pandas

from sqlalchemy import create_engine

engine = create_engine('sqlite:///edgar_htm_idx.db')

with engine.connect() as conn, conn.begin():

data = pandas.read_sql_table('idx', conn)

data.to_stata('edgar_htm_idx.dta')

The first part of the code generates a dataset of the complete path information of SEC filings for the selected period (in both SQLite and Stata). Then, you can select a sample based on firm, form type, filing date, etc. and feed a CSV file to the second part of the code. The feeding CSV should look like this:

13780110,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2014,https://www.sec.gov/Archives/edgar/data/5272/0000005272-14-000007-index.htm
16212215,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,03nov2016,https://www.sec.gov/Archives/edgar/data/5272/0000005272-16-000052-index.htm
6772655,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2007,https://www.sec.gov/Archives/edgar/data/5272/0000950123-07-007211-index.htm
5671285,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2006,https://www.sec.gov/Archives/edgar/data/5272/0000950123-06-006104-index.htm
10831058,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2011,https://www.sec.gov/Archives/edgar/data/5272/0001047469-11-004647-index.htm

13780110,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2014,https://www.sec.gov/Archives/edgar/data/5272/0000005272-14-000007-index.htm

16212215,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,03nov2016,https://www.sec.gov/Archives/edgar/data/5272/0000005272-16-000052-index.htm

6772655,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2007,https://www.sec.gov/Archives/edgar/data/5272/0000950123-07-007211-index.htm

5671285,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2006,https://www.sec.gov/Archives/edgar/data/5272/0000950123-06-006104-index.htm

10831058,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2011,https://www.sec.gov/Archives/edgar/data/5272/0001047469-11-004647-index.htm

The second part of the code:

import csv
import random
import time

from selenium import webdriver

with open('log.csv', 'w', newline='') as log:
    logwriter = csv.writer(log)

    with open('sample.csv', newline='') as infile:
        records = csv.reader(infile)

        for r in records:
            log_row = r.copy()
            print('Start fetching URL to', r[2], r[3], 'filed on', r[4], '...')
            start_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())

            driver = webdriver.Chrome('./chromedriver')

            try:
                driver.get(r[5])
                time.sleep(3 + random.random() * 3)
                filing_date = driver.find_element_by_xpath('//*[@id="formDiv"]/div[2]/div[1]/div[2]').text
                period_of_report = driver.find_element_by_xpath('//*[@id="formDiv"]/div[2]/div[2]/div[2]').text
                form_text = driver.find_element_by_xpath('//*[@id="formDiv"]/div/table/tbody/tr[2]/td[3]/a').text
                form_link = driver.find_element_by_link_text(form_text).get_attribute('href')
                end_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())
                print('Success!', start_time, ' --> ', end_time, '\n')
                log_row = log_row + [start_time, end_time, filing_date, period_of_report, form_link]

            except:
                end_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())
                print('Error!', start_time, ' --> ', end_time, '\n')
                log_row = log_row + [start_time, end_time, 'ERROR!']

            driver.quit()

            logwriter.writerow(log_row)

import csv

import random

import time

from selenium import webdriver

with open('log.csv', 'w', newline='') as log:

logwriter = csv.writer(log)

with open('sample.csv', newline='') as infile:

records = csv.reader(infile)

for r in records:

log_row = r.copy()

print('Start fetching URL to', r[2], r[3], 'filed on', r[4], '...')

start_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())

driver = webdriver.Chrome('./chromedriver')

try:

driver.get(r[5])

time.sleep(3 + random.random() * 3)

filing_date = driver.find_element_by_xpath('//*[@id="formDiv"]/div[2]/div[1]/div[2]').text

period_of_report = driver.find_element_by_xpath('//*[@id="formDiv"]/div[2]/div[2]/div[2]').text

form_text = driver.find_element_by_xpath('//*[@id="formDiv"]/div/table/tbody/tr[2]/td[3]/a').text

form_link = driver.find_element_by_link_text(form_text).get_attribute('href')

end_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())

print('Success!', start_time, ' --> ', end_time, '\n')

log_row = log_row + [start_time, end_time, filing_date, period_of_report, form_link]

except:

end_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())

print('Error!', start_time, ' --> ', end_time, '\n')

log_row = log_row + [start_time, end_time, 'ERROR!']

driver.quit()

logwriter.writerow(log_row)

Please note:

Please use Python 3.x.
Please install all required modules such as Selenium.
The second part of the code uses Selenium. There are other ways to do the job, e.g., using BeautifulSoup.
The second part of the code only output the direct URL to the HTML-format filing; it does not download the filing.

Posted in Python | 48 Comments

Sample code for “outreg” command in Stata

Posted on September 18, 2017 by Kai Chen

outreg is a time-saving and must-have command in Stata. It will generate a ready-for-use results table like this. I’m sure you will see what a relief this can give us.

outreg is not a built-in command and can be installed by issuing the following command:

ssc install outreg

The typical usage of outreg is:

outreg using results, stats(b p) sdec(3) ///
summstat(r2_a \ N) summdec(3,0) summtitles("Adjusted R2" \ "N") ///
starlevels(10 5 1) starloc(1) ///
ctitles("", "Heading" \ "", "Subheading") ///
keep(_cons x1 x2 x3 x4) merge replace

outreg using results, stats(b p) sdec(3) ///

summstat(r2_a \ N) summdec(3,0) summtitles("Adjusted R2" \ "N") ///

starlevels(10 5 1) starloc(1) ///

ctitles("", "Heading" \ "", "Subheading") ///

keep(_cons x1 x2 x3 x4) merge replace

x1, x2, x3 and x4 are independent variables that you want to report estimates for. Sometimes you may not want to report unimportant independent variables. If you include interaction terms in regressions like this:

regress y c.x1##c.x2

By the way, this is the highly recommended way to include interaction terms in regressions. If you want to report estimates for x1, x2, and x1 × x2, please use the following keep option:

keep(_cons x1 x2 c.x1#c.x2)

I list other frequently used options that control the appearance of the results table:

stats: b, se, t, or p. Usually b and the other statistics will be reported.

- - b: coefficient estimates
  - se: standard errors of estimate
  - t: t statistics for the test of b=0
  - p: p value of t statistics

nosubstat: I do not include this option in the sample code. This option will report two selected statistics (usually b and the other one) in two columns, rather than putting one (e.g., t statistics) below the other (e.g., b).

varlabels: I do not include this option in the sample code. This option will report variable labels, rather than variable names in the generated table. Sometimes variable labels is easier to understand than variable names. In this case, you can set labels for independent variables and turn on this option.

nosubstat and varlabels can be placed anywhere after the comma.

The complete help file for this command can be found here.

There is a similar command outreg2 that you can check. But I find outreg is good enough and works really well for me.

Posted in Stata | Leave a comment

Use Python to download data from the DTCC’s Swap Data Repository

Posted on April 23, 2017 by Kai Chen

I helped my friend to download data from the DTCC’s Swap Data Repository. I am not familiar with the data and just use this as a programming practice.

This article gives an introduction to the origin of the data: http://www.dtcc.com/news/2013/january/03/swap-data-repository-real-time

The Python script will:

download the daily Credit zip files; and
extract CSV from individual zip files and combine the content into a single huge CSV (size 220MB), which then can be imported into Stata or other statistical package.

As of April 22, 2016, there were around one million historical records. The data seems available from April 6, 2013 and missing sporadically from then on. The Python script will print the bad dates where the daily data is not available.

import io
import zipfile
from datetime import date

import pandas as pd
import requests

start = date(2013, 1, 1)
end = date.today()

urls = []

for i in range(start.toordinal(), end.toordinal()):
    datestr = date.fromordinal(i).isoformat().replace('-', '_')
    url = ('https://kgc0418-tdw-data2-0.s3.amazonaws.com/slices/CUMULATIVE_CREDITS_' + datestr + '.zip',
           'CUMULATIVE_CREDITS_' + datestr + '.zip')
    urls.append(url)

badurls = []

df = pd.DataFrame()

for url in urls:
    request = requests.get(url[0])
    if not zipfile.is_zipfile(io.BytesIO(request.content)):
        print(url[1], 'is non-existent!')
        badurls.append(url)
    else:
        with open(url[1], 'wb') as f:
            f.write(request.content)
        print(url[1], 'downloaded!')
        z = zipfile.ZipFile(io.BytesIO(request.content))
        df_ = pd.read_csv(z.open(z.namelist()[0]))
        df_['DATE'] = url[1][19:29]
        df = df.append(df_, ignore_index=True)

df.to_csv('dtcc.csv')

print(badurls)

import io

import zipfile

from datetime import date

import pandas as pd

import requests

start = date(2013, 1, 1)

end = date.today()

urls = []

for i in range(start.toordinal(), end.toordinal()):

datestr = date.fromordinal(i).isoformat().replace('-', '_')

url = ('https://kgc0418-tdw-data2-0.s3.amazonaws.com/slices/CUMULATIVE_CREDITS_' + datestr + '.zip',

'CUMULATIVE_CREDITS_' + datestr + '.zip')

urls.append(url)

badurls = []

df = pd.DataFrame()

for url in urls:

request = requests.get(url[0])

if not zipfile.is_zipfile(io.BytesIO(request.content)):

print(url[1], 'is non-existent!')

badurls.append(url)

else:

with open(url[1], 'wb') as f:

f.write(request.content)

print(url[1], 'downloaded!')

z = zipfile.ZipFile(io.BytesIO(request.content))

df_ = pd.read_csv(z.open(z.namelist()[0]))

df_['DATE'] = url[1][19:29]

df = df.append(df_, ignore_index=True)

df.to_csv('dtcc.csv')

print(badurls)

Posted in Data, Python | 1 Comment

Download FR Y-9C data from WRDS

Posted on April 23, 2017 by Kai Chen

WRDS currently populates FR Y-9C data quarter by quarter in individual datasets, like BHCF200803, BHCF200806, BHCF200809 and so on. WRDS has not stacked those individual datasets to formulate a single time-series dataset like COMPUSTAT.

There are two ways to overcome this:

Use the web query on WRDS. The web query allows users to specify a date range and return a single time-series dataset.
I wrote a SAS script which is equivalent to the web query but with more ease for future update. The code currently can accept a date range and download selected variables.

 %let wrds = wrds-cloud.wharton.upenn.edu 4016; 
 options comamid=TCP remote=wrds;
 signon username = _prompt_;

libname local "D:\";

rsubmit;

 %let start = 2008Q1;                         /*define start quarter      */ 
 %let end = 2009Q4;                           /*define end quarter        */ 
 %let var = rssd9001, rssd9999, bhck2170;     /*define selected variables */

 %macro date_loop(start,end);
 /*converts the dates to SAS dates*/
 %let start=%sysfunc(inputn(&start,anydtdte9.));
 %let end=%sysfunc(inputn(&end,anydtdte9.));
 /*determines the number of quarters between the two dates*/
 %let dif=%sysfunc(intck(quarter,&start,&end));
 %do i=0 %to &dif;
 /*advances the date i quarters from the start date and applies the yymmn. format*/
 %let date=%sysfunc(putn(%sysfunc(intnx(quarter,&start,&i,e)),yymmn.));
 bank.bhcf&date
 %end;
 %mend;
 
data tsbhc;
  set %date_loop(&start,&end);
run;

proc sql;
  create table bhcdata as
  select &var from tsbhc;
quit;

proc download data=bhcdata out=local.bhcdata; run;

endrsubmit;
signoff;

%let wrds = wrds-cloud.wharton.upenn.edu 4016;

options comamid=TCP remote=wrds;

signon username = _prompt_;

libname local "D:\";

rsubmit;

%let start = 2008Q1; /*define start quarter */

%let end = 2009Q4; /*define end quarter */

%let var = rssd9001, rssd9999, bhck2170; /*define selected variables */

%macro date_loop(start,end);

/*converts the dates to SAS dates*/

%let start=%sysfunc(inputn(&start,anydtdte9.));

%let end=%sysfunc(inputn(&end,anydtdte9.));

/*determines the number of quarters between the two dates*/

%let dif=%sysfunc(intck(quarter,&start,&end));

%do i=0 %to &dif;

/*advances the date i quarters from the start date and applies the yymmn. format*/

%let date=%sysfunc(putn(%sysfunc(intnx(quarter,&start,&i,e)),yymmn.));

bank.bhcf&date

%end;

%mend;

data tsbhc;

set %date_loop(&start,&end);

run;

proc sql;

create table bhcdata as

select &var from tsbhc;

quit;

proc download data=bhcdata out=local.bhcdata; run;

endrsubmit;

signoff;

Posted in Data, SAS | 3 Comments

Stata command to convert string GVKEY to numerical GVKEY or vice versa

Stata command to calculate the area under ROC curve

Stata commands to calculate skewness

Lawsuit data from Stanford Law School’s Securities Class Action Clearinghouse

Calculate idiosyncratic stock return volatility

Commonly used Stata commands to deal with potential outliers

Use Python to extract URLs to HTML-format SEC filings on EDGAR

Sample code for “outreg” command in Stata

Use Python to download data from the DTCC’s Swap Data Repository

Download FR Y-9C data from WRDS

Categories

Archives

Site Admin