The code in this post is used to calculate Campbell and Taksler’s (2003) idiosyncratic stock return volatility, but it can be easily modified for other definitions.

Specifically, this code requires an input dataset that includes two variables: permno and enddt, where enddt is the date of interest. This code will calculate the standard deviation of daily abnormal returns over the 180 calendar days before (and including) enddt. Abnormal returns will be calculated using four methods: (1) market-adjusted; (2) standard market model; (3) Fama-French three factors; and (4) Fama-French three factors as well as momentum. This code requires at least 21 return observations (one-month trading days) over that 180-day period for a permno to calculate its stock return volatility.

]]>http://www.kaikaichen.com/?feed=rss2&p=10090Commonly used Stata commands to deal with potential outliershttp://www.kaikaichen.com/?p=977
http://www.kaikaichen.com/?p=977#respondSat, 11 Nov 2017 04:06:55 +0000http://kaikaichen.com/?p=977Continue reading →]]>In accounting archival research, we often take it for granted that we must do something to deal with potential outliers before we run a regression. The commonly used methods are: truncate, winsorize, studentized residuals, and Cook’s distance. I discuss in this post which Stata command to use to implement these four methods.

First of all, why and how we deal with potential outliers is perhaps one of the messiest issues that accounting researchers will encounter, because no one ever gives a definitive and satisfactory answer. In my opinion, only outliers resulting from apparent data errors should be deleted from the sample. That said, this post is not going to answer that messy question; instead, the purpose of this post is to summarize the Stata commands for commonly used methods of dealing with outliers (even if we are not sure whether these methods are appropriate—we all know that is true in accounting research!). Let’s start.

Truncate and winsorize

In my opinion, the best Stata commands to do truncate and winsorize are truncateJ and winsorizeJ written by Judson Caskey. I will save time to explain why, but simply highly recommend his work. Please see his website here.

After the installation, you can type help truncateJ or help winsorizeJ to learn how to use these two commands.

Studentized residuals

The first step is to run a regression without specifying any vce parameter in Stata (i.e., not using robust or clustered error terms). Suppose the dependent variable is y, and independent variables are x1 and x2. The first step should look like this:

regress y x1 x2

Then, use the predict command:

predict rstu if e(sample), rstudent

If the absolute value of rstu exceed certain critical values, the data point will be considered as an outlier and be deleted from the final sample. Stata’s manual indicates that “studentized residuals can be interpreted as the t statistic for testing the significance of a dummy variable equal to 1 in the observation in question and 0 elsewhere. Such a dummy variable would effectively absorb the observation and so remove its influence in determining the other coefficients in the model.” To be honest, I do not fully understand this explanation, but since rstu is a t statistics, the critical value for a traditional significance level should be applied, for example, 1.96 (or 2) for 5% significance level. That’s why in literature we often see that data points with absolute values of studentized residuals greater than 2 will be deleted. Some papers use the critical value of 3, which corresponds to 0.27% significance level, and seems to me not very reasonable.

Now use the following command to drop “outliers” based on the critical value of 2:

drop if abs(rstu) > 2

The last step is to re-run the regression, but this time we can add appropriate vce parameters to address additional issues such as heteroskedasticity:

regress y x1 x2, vce(robust), or

regress y x1 x2, vce(cl gvkey)

Cook’s distance

This method is similar to studentized residuals. We predict a specific residual, namely Cook’s distance, and then delete any data points with Cook’s distance greater than 4/N (Cook’s distance is always positive).

regress y x1 x2

predict cooksd if e(sample), cooksd

drop if cooksd > critical value

Next, re-run the regression with appropriate vce parameters:

regress y x1 x2, vce(robust), or

regress y x1 x2, vce(cl gvkey)

Lastly, I thank the authors of the following articles which I benefit from:

Although TXT-format files have benefits of easy further handling, they are oftentimes not well formatted and thus hard to read. A HTML-format 10-K is more pleasing to eyes. Actually, SEC also provides the paths (namely, URLs) to HTML-format filings. With the path, we can open a HTML-format filing in a web browser, or further download the filing as a PDF.

There remain two parts in the Python code. In the first part, we need download the path data. Instead of using master.idx in the above two posts, we need use crawler.idx for this task. The path we get will be a URL like this:

Note that the path we get is a URL to an index page, not a URL to the HTML-format 10-Q in this example. To get the direct URL to the HTML-format 10-Q, we have to go one-level deeper. The second part of the Python code is used to go that deeper and extract the direct URL to the main body of the Form (the URL embedded in the first row in more than 99% cases). The code also extracts such information as filing date and period of report on the index page. The code writes the output (including filing date, period of report and direct URL) in log.csv. The following is an output example—the first URL is the path we get in the first part of the code; the second URL is the direct URL to the HTML-format Form.

13780110,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2014,https://www.sec.gov/Archives/edgar/data/5272/0000005272-14-000007-index.htm,2017-10-11 23:44:42,2017-10-11 23:44:50,2014-05-05,2014-03-31,https://www.sec.gov/Archives/edgar/data/5272/000000527214000007/maindocument001.htm
16212215,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,03nov2016,https://www.sec.gov/Archives/edgar/data/5272/0000005272-16-000052-index.htm,2017-10-11 23:44:51,2017-10-11 23:44:58,2016-11-03,2016-09-30,https://www.sec.gov/Archives/edgar/data/5272/000000527216000052/maindocument001.htm
6772655,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2007,https://www.sec.gov/Archives/edgar/data/5272/0000950123-07-007211-index.htm,2017-10-11 23:44:59,2017-10-11 23:45:05,2007-05-10,2007-03-31,https://www.sec.gov/Archives/edgar/data/5272/000095012307007211/y32085e10vq.htm
5671285,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2006,https://www.sec.gov/Archives/edgar/data/5272/0000950123-06-006104-index.htm,2017-10-11 23:45:07,2017-10-11 23:45:14,2006-05-10,2006-03-31,https://www.sec.gov/Archives/edgar/data/5272/000095012306006104/y19465e10vq.htm
10831058,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2011,https://www.sec.gov/Archives/edgar/data/5272/0001047469-11-004647-index.htm,2017-10-11 23:45:15,2017-10-11 23:45:20,2011-05-05,2011-03-31,https://www.sec.gov/Archives/edgar/data/5272/000104746911004647/a2203832z10-q.htm

The first part of the code generates a dataset of the complete path information of SEC filings for the selected period (in both SQLite and Stata). Then, you can select a sample based on firm, form type, filing date, etc. and feed a CSV file to the second part of the code. The feeding CSV should look like this:

13780110,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2014,https://www.sec.gov/Archives/edgar/data/5272/0000005272-14-000007-index.htm
16212215,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,03nov2016,https://www.sec.gov/Archives/edgar/data/5272/0000005272-16-000052-index.htm
6772655,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2007,https://www.sec.gov/Archives/edgar/data/5272/0000950123-07-007211-index.htm
5671285,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2006,https://www.sec.gov/Archives/edgar/data/5272/0000950123-06-006104-index.htm
10831058,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2011,https://www.sec.gov/Archives/edgar/data/5272/0001047469-11-004647-index.htm

]]>http://www.kaikaichen.com/?feed=rss2&p=8910Use Python to download data from the DTCC’s Swap Data Repositoryhttp://www.kaikaichen.com/?p=811
http://www.kaikaichen.com/?p=811#respondSun, 23 Apr 2017 04:36:16 +0000http://kaikaichen.com/?p=811Continue reading →]]>I helped my friend to download data from the DTCC’s Swap Data Repository. I am not familiar with the data and just use this as a programming practice.

extract CSV from individual zip files and combine the content into a single huge CSV (size 220MB), which then can be imported into Stata or other statistical package.

As of April 22, 2016, there were around one million historical records. The data seems available from April 6, 2013 and missing sporadically from then on. The Python script will print the bad dates where the daily data is not available.

Good luck to everyone who tries to publish a paper on The Accounting Review!!!

PS: I have lost my love for MathType. It drives me crazy for converting my equations to un-editable graphs over and over again. I start using Word’s built-in Equation Editor. But Microsoft apparently cannot make the font look right. Install STIX math font if you are as picky as I am. STIX develops a math font that makes equations in Word look a lot like Times New Roman. Just google “STIX math font”.

]]>http://www.kaikaichen.com/?feed=rss2&p=7251Use Python to download TXT-format SEC filings on EDGAR (Part II)http://www.kaikaichen.com/?p=681
http://www.kaikaichen.com/?p=681#commentsSun, 10 Apr 2016 00:09:03 +0000http://kaikaichen.com/?p=681Continue reading →]]>As I said in the post entitled “Part I“, we have to do two steps in order to download SEC filings on EDGAR:

Find paths to raw text filings;

Select what we want and bulk download from the EDGAR FTP server using paths we have obtained in the first step.

“Part I” elaborates the first step. This post shares Python codes for the second step.

In the first step, I save index files in a SQLite database as well as a Stata dataset. The index database includes all types of filings (e.g., 10-K and 10-Q). Select from the database the types that you want and export your selections into a CSV file, say “sample.csv”. To use the following Python codes, the format of the CSV file must look like this (this example selects all 10-Ks of Apple Inc):

I do not take care of file directories of “sample.csv” and output raw text filings in the codes. You can modify by yourself. saveas = '-'.join([line[0], line[2], line[3]]) is used to name the output SEC filings. The current name is cik-form type-filing date.txt. Please move around these elements to accommodate your needs (thank Eva for letting me know a previous error here).

Update on March 3, 2017: SEC closed the FTP server permanently on December 30, 2016 and started to use a more secure transmission protocol—https. Since then I have received several requests to update the script. Here it is the new codes for Part II.

Factiva (as well as LexisNexis Academic) is a comprehensive repository of newspapers, magazines, and other news articles. I first describe the data elements of a Factiva news article. Then I explain the steps to extract those data elements and write them into a more machine-readable table using Python.

Data Elements in Factiva Article

Each news article in Factiva, no matter how it looks like, contains a number of data elements. In Factiva’s terminology, those data elements are called Intelligence Indexing Fields. The following table lists the label and name for each data element (or, field) along with what is contained in each:

Field Label

Field Name

What It Contains

HD

Headline

Headline

CR

Credit Information

Credit Information (Example: Associated Press)

WC

Word Count

Number of words in document

PD

Publication Date

Publication Date

ET

Publication Time

Publication Time

SN

Source Name

Source Name

SC

Source Code

Source Code

ED

Edition

Edition of publication (Example: Final)

PG

Page

Page on which article appeared (Note: Page-One Story is a Dow Jones Intelligent Indexingª term)

LA

Language

Language in which the document is written

CY

Copyright

Copyright

LP

Lead Paragraph

First two paragraphs of an article

TD

Text

Text following the lead paragraphs

CT

Contact

Contact name to obtain additional information

RF

Reference

Notes associated with a document

CO

Dow Jones Ticker Symbol

Dow Jones Ticker Symbol

IN

Industry Code

Dow Jones Intelligent Indexingª Industry Code

NS

Subject Code

Dow Jones Intelligent Indexingª Subject Code

RE

Region Code

Dow Jones Intelligent Indexingª Region Code

IPC

Information Provider Code

Information Provider Code

IPD

Information Provider Descriptors

Information Provider Descriptors

PUB

Publisher Name

Publisher of information

AN

Accession Number

Unique Factiva.com identification number assigned to each document

Please note that not every news article contains all those data elements, and that the table may not list all data elements used by Factiva (Factiva may make updates). Depending on which display option you select when downloading news articles from Factiva, you may not be able to see certain data elements. But they are there and used by Factiva to organize and structure its proprietary news article data.

How to Extract Data Elements in Factiva Article

You can follow three steps outlined in the above diagram to extract data elements in news articles and for further processing (e.g., calculate tone of full text represented by both LP and TD element; or group by news subject, i.e., by NS element). I explain them one by one as follows.

Step 1: Download Articles from Factiva in RTF Format

It is a lot of pain to download a large number of news articles from Factiva: it is technically difficult to download articles in an automated fashion; you can only download 100 articles at a time, also those 100 articles cannot exceed the word count limit, i.e., 180,000. As a result, it requires a lot of tedious work if you want to gather tens of thousands news articles. While I can do nothing about both issues in this post, I can say a bit more about them.

Firstly, you may see some people discuss methods for automatic downloading (a so-called “webscraping” technique. See here). However, this needs more hacking after Factiva introduced CAPTCHA to determine whether or not the user is a human. You may not be familiar with the term “CAPTCHA”, but you must experience the circumstance where you are asked to input characters or numbers shown in an image before you can download a file or go to the next webpage. That is CAPTCHA. Both Factiva and LexisNexis Academic have introduced CAPTCHA to prohibit robotic downloading. Though CAPTCHA is not unbeatable, it requires advanced technique.

Secondly, the Factiva licence expressly prohibits data mining. However, the licence does not define clearly what constitutes data mining. I was informed that downloading a large number of articles in a short period of time would be red flagged as data mining. But the threshold speed set by Factiva is low and any trained and adept person can beat that threshold speed easily. If you are red flagged by Factiva, things could go ugly. So, do not be too fast, even this may slow down your research.

Let’s get back to the topic. When you manually download news articles from Factiva, the most important thing is to select the right display option. Please select the third one: Full Article/Report plus Indexing as indicated by the following graph:

Then you have to download articles in RTF – Article Format, as indicated by the following graph:

After the download is completed, you will get an RTF document. If you open it, you will find news articles look like this:

The next step is to convert RTF to plain TXT, because Python can process TXT documents more easily. After Python finishes its job, the final product will be a table: each row of the table represents a news article; and each column of the table is a data element.

Step 2: Convert RTF to TXT

Well, this can surely be done by Python. But so far I have not written a Python program to do this. I will complete this “hole” when I have time. For my research, I simply take advantage of the convenience of the default text editor shipped with Mac OS, TextEdit. I select Format – Make Plain Text from the menu bar, and then save the document in TXT format. You can make this happen in an automatic fashion using Automator in Mac OS.

Step 3: Extract Data Elements and Save to a Table

This is where Python does the dirty work. To run the Python program correctly, please save the Python program in the directory where you put all plain TXT documents created in Step 2 before you run the program. This program will:

Read in each TXT document;

Extract data elements of each article and write them to an SQLite database;

Export data to a CSV file for easy processing in other software such as Stata.

I introduce an intermediate step which writes data to an SQLite database, simply because this can facilitate manipulation of news article data using Python for other purposes. Of course, you can directly write data to a CSV file.