[Update on 2020-06-26] Eduardo has made a significant improvement to the code. Now you can specify a starting date and download the index file during the period from that starting date to the most recent date. I expect it to be very useful for many readers of my website. Eduardo has kindly shared the code in the comment. Thank you, Eduardo!

[Update on 2019-08-07] From time to time, some readers informed that the first-part code seemingly stopped at certain quarters. I don’t know the exact reason (perhaps a server-side issue). I never encountered the issue. I would suggest that you just try again later. I also share a Dropbox link from which you can download the first-part results (as of 2019-08-07; 2.4GB) in the CSV format (link). Please note—as I have explained in my original post, the URL contained in the downloadable CSV is not the URL to the HTML-format filing; it is just the URL to an index page. You need to select your sample and go through the second-part code to get the URL to the HTML-format filing.

[Original post] I wrote two posts to describe how to download TXT-format SEC filings on EDGAR:

Although TXT-format files have benefits of easy further handling, they are oftentimes not well formatted and thus hard to read. A HTML-format 10-K is more pleasing to eyes. Actually, SEC also provides the paths (namely, URLs) to HTML-format filings. With the path, we can open a HTML-format filing in a web browser, or further download the filing as a PDF.

There remain two parts in the Python code. In the first part, we need download the path data. Instead of using master.idx in the above two posts, we need use crawler.idx for this task. The path we get will be a URL like this:

https://www.sec.gov/Archives/edgar/data/859747/0001477932-16-007969-index.htm

Note that the path we get is a URL to an index page, not a URL to the HTML-format 10-Q in this example. To get the direct URL to the HTML-format 10-Q, we have to go one-level deeper. The second part of the Python code is used to go that deeper and extract the direct URL to the main body of the Form (the URL embedded in the first row in more than 99% cases). The code also extracts such information as filing date and period of report on the index page. The code writes the output (including filing date, period of report and direct URL) in log.csv. The following is an output example—the first URL is the path we get in the first part of the code; the second URL is the direct URL to the HTML-format Form.

The first part of the code:

The first part of the code generates a dataset of the complete path information of SEC filings for the selected period (in both SQLite and Stata). Then, you can select a sample based on firm, form type, filing date, etc. and feed a CSV file to the second part of the code. The feeding CSV should look like this:

The second part of the code:

Please note:

  1. Please use Python 3.x.
  2. Please install all required modules such as Selenium.
  3. The second part of the code uses Selenium. There are other ways to do the job, e.g., using BeautifulSoup.
  4. The second part of the code only output the direct URL to the HTML-format filing; it does not download the filing.

Categorized in: