{"id":59,"date":"2015-02-28T03:13:58","date_gmt":"2015-02-28T08:13:58","guid":{"rendered":"https:\/\/kchen032.wordpress.com\/?p=59"},"modified":"2019-09-04T09:20:22","modified_gmt":"2019-09-04T13:20:22","slug":"use-python-to-get-edgar-index-files","status":"publish","type":"post","link":"https:\/\/www.kaichen.work\/?p=59","title":{"rendered":"Use Python to download TXT-format SEC filings on EDGAR (Part I)"},"content":{"rendered":"<p><strong>[Update on 2019-07-31]<\/strong>\u00a0This post, together with its sibling post \u201c<a href=\"http:\/\/kaichen.work\/?p=681\" target=\"_blank\" rel=\"noopener noreferrer\">Part II<\/a>\u201c, has been my most-viewed post since I created this website. However, the landscape of 10-K\/Q filings has changed dramatically over the past decade, and the text-format filings are extremely unfriendly for researchers nowadays. I would suggest directing our research efforts to html-format filings with the help of BeautifulSoup. The other\u00a0<a href=\"http:\/\/kaichen.work\/?p=946\" target=\"_blank\" rel=\"noopener noreferrer\">post<\/a>\u00a0deserves more attention.<\/p>\n<p><strong>[Update on 2018-10-06]<\/strong>\u00a0As I acknowledged in the very first edition of this post, I\u00a0borrowed some codes from Edouard Swiac&#8217;s Python module &#8220;python-edgar&#8221; (version: 1.0). Edouard kindly informed me that he had updated his module (see his GitHub <a href=\"https:\/\/github.com\/edouardswiac\/python-edgar\" target=\"_blank\" rel=\"noopener noreferrer\">page<\/a>). The major updates to his module include: (1) he migrated the file download from FTP to HTTPS and (2) added parallel downloads so now it is faster to rebuild the full index, especially if going all the way to 1993. My initial thoughts about his updated module is that it provides more flexibility and should be more robust than mine. Thank you Edouard for your work!<\/p>\n<p><strong>[Update on 2017-03-03]<\/strong> SEC closed the FTP server permanently on December 30, 2016 and started to use a more secure transmission protocol\u2014https. So the description about the FTP server in the original post is not applicable any more (but the basic idea about the URLs to raw text filings remain unchanged.) Since then I have received several requests to update the script. Here it is the new script for Part I.<\/p>\n<pre class=\"lang:python decode:true \"># Generate the list of index files archived in EDGAR since start_year (earliest: 1993) until the most recent quarter\r\nimport datetime\r\n\r\n# Please download index files chunk by chunk. For example, please first download index files during 1993\u20132000, then\r\n# download index files during 2001\u20132005 by changing the following two lines repeatedly, and so on. If you need index\r\n# files up to the most recent year and quarter, comment out the following three lines, remove the comment sign at\r\n# the starting of the next three lines, and define the start_year that immediately follows the ending year of the\r\n# previous chunk.\r\n\r\nstart_year = 2011       # change start_year and end_year to re-define the chunk\r\ncurrent_year = 2015     # change start_year and end_year to re-define the chunk\r\ncurrent_quarter = 4     # do not change this line\r\n\r\n# start_year = 2016     # only change this line to download the most recent chunk\r\n# current_year = datetime.date.today().year\r\n# current_quarter = (datetime.date.today().month - 1) \/\/ 3 + 1\r\n\r\nyears = list(range(start_year, current_year))\r\nquarters = ['QTR1', 'QTR2', 'QTR3', 'QTR4']\r\nhistory = [(y, q) for y in years for q in quarters]\r\nfor i in range(1, current_quarter + 1):\r\n    history.append((current_year, 'QTR%d' % i))\r\nurls = ['https:\/\/www.sec.gov\/Archives\/edgar\/full-index\/%d\/%s\/master.idx' % (x[0], x[1]) for x in history]\r\nurls.sort()\r\n\r\n# Download index files and write content into SQLite\r\nimport sqlite3\r\nimport requests\r\n\r\ncon = sqlite3.connect('edgar_idx.db')\r\ncur = con.cursor()\r\ncur.execute('DROP TABLE IF EXISTS idx')\r\ncur.execute('CREATE TABLE idx (cik TEXT, conm TEXT, type TEXT, date TEXT, path TEXT)')\r\n\r\nfor url in urls:\r\n    lines = requests.get(url).content.decode(\"utf-8\", \"ignore\").splitlines()\r\n    records = [tuple(line.split('|')) for line in lines[11:]]\r\n    cur.executemany('INSERT INTO idx VALUES (?, ?, ?, ?, ?)', records)\r\n    print(url, 'downloaded and wrote to SQLite')\r\n\r\ncon.commit()\r\ncon.close()\r\n\r\n# Write SQLite database to Stata\r\nimport pandas\r\nfrom sqlalchemy import create_engine\r\n\r\nengine = create_engine('sqlite:\/\/\/edgar_idx.db')\r\nwith engine.connect() as conn, conn.begin():\r\n    data = pandas.read_sql_table('idx', conn)\r\n    data.to_stata('edgar_idx.dta', version=117)<\/pre>\n<p>The technical details may be too boring to most people. So, I provide multiple downloadable Stata datasets that include all index files from 1993 Q1 to October 6, 2018.<\/p>\n<p>Stata format (<a href=\"https:\/\/www.dropbox.com\/s\/ab88fb7tobpzetd\/edgar_idx_1993_2000.dta?dl=0\" target=\"_blank\" rel=\"noopener noreferrer\">1993\u20132000<\/a>); Stata format (<a href=\"https:\/\/www.dropbox.com\/s\/j49dgbpzgbiwkqr\/edgar_idx_2001_2005.dta?dl=0\" target=\"_blank\" rel=\"noopener noreferrer\">2001\u20132005<\/a>); Stata format (<a href=\"https:\/\/www.dropbox.com\/s\/r45saaa4yqi0bbz\/edgar_idx_2006_2010.dta?dl=0\" target=\"_blank\" rel=\"noopener noreferrer\">2006\u20132010<\/a>); Stata format (<a href=\"https:\/\/www.dropbox.com\/s\/gjpsny7szzssk28\/edgar_idx_2011_2015.dta?dl=0\" target=\"_blank\" rel=\"noopener noreferrer\">2011\u20132015<\/a>); Stata format (<a href=\"https:\/\/www.dropbox.com\/s\/rbrrspeajostcrg\/edgar_idx_2016_20190316.dta?dl=0\" target=\"_blank\" rel=\"noopener noreferrer\">2016\u20132019\/03\/16<\/a>)<\/p>\n<p><strong>[Original Post]<\/strong>\u00a0We know that SEC makes company filings (e.g.,10-Ks, 10-Qs and 8-Ks) publicly available on EDGAR. The <a href=\"http:\/\/www.sec.gov\/edgar\/searchedgar\/companysearch.html\" target=\"_blank\" rel=\"noopener noreferrer\">web search interface<\/a>\u00a0is convenient, but we may need to bulk download raw text filings. SEC provides an\u00a0anonymous <a href=\"ftp:\/\/ftp.sec.gov\" target=\"_blank\" rel=\"noopener noreferrer\">EDGAR FTP server<\/a> to access raw text filings. Usually, if we know the path or URL to a file on an FTP server, we can easily use an Internet browser or an FTP software to connect to the server and download the file. For example, if we navigate a bit on the EDGAR FTP server, we can find the path to the file &#8220;master.idx&#8221; as follows:<\/p>\n<p>`ftp:\/\/ftp.sec.gov\/edgar\/full-index\/2015\/QTR4\/master.idx`<\/p>\n<p>Copy the path into an Internet browser or an FTP software, we can download the file directly.<\/p>\n<p>In the above example, we can find the path to &#8220;master.idx&#8221; by navigating on the EDGAR FTP server. But we\u00a0cannot find any\u00a0path to any raw text filing. In other words, paths\u00a0to raw text filings are\u00a0not visible by simply looking into the EDGAR FTP server. SEC purposely hides paths to raw text filings\u00a0to reduce server load and avoid data abuse.<\/p>\n<p>In order to download SEC\u00a0filings on EDGAR, we have to:<\/p>\n<ol>\n<li>Find paths to raw text filings;<\/li>\n<li>Select what we want and bulk download raw text filings from the EDGAR FTP server using paths we have obtained in the first step.<\/li>\n<\/ol>\n<p>This post describes the first step, and I elaborate the second step in <a href=\"http:\/\/kaichen.work\/?p=681\" target=\"_blank\" rel=\"noopener noreferrer\">another post<\/a>.<\/p>\n<p>SEC stores all path information in index files. See technical details\u00a0<a href=\"http:\/\/www.sec.gov\/edgar\/searchedgar\/ftpusers.htm\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a>. Let&#8217;s take a snap shot of an index file:<\/p>\n<pre class=\"lang:default decode:true \">Description:           Master Index of EDGAR Dissemination Feed\r\nLast Data Received:    December 31, 2015\r\nComments:              webmaster@sec.gov\r\nAnonymous FTP:         ftp:\/\/ftp.sec.gov\/edgar\/\r\n \r\n \r\n \r\n \r\nCIK|Company Name|Form Type|Date Filed|Filename\r\n--------------------------------------------------------------------------------\r\n1000028|AIM SAFETY CO INC|REVOKED|2015-12-21|edgar\/data\/1000028\/9999999997-15-015914.txt\r\n1000032|BINCH JAMES G|4|2015-12-02|edgar\/data\/1000032\/0001209191-15-082911.txt\r\n1000032|BINCH JAMES G|4|2015-12-11|edgar\/data\/1000032\/0001209191-15-084690.txt\r\n1000045|NICHOLAS FINANCIAL INC|10-Q|2015-11-09|edgar\/data\/1000045\/0001193125-15-371499.txt\r\n1000045|NICHOLAS FINANCIAL INC|8-K|2015-10-29|edgar\/data\/1000045\/0001193125-15-357258.txt\r\n1000045|NICHOLAS FINANCIAL INC|SC 13G\/A|2015-11-03|edgar\/data\/1000045\/0001436857-15-000017.txt\r\n1000097|KINGDON CAPITAL MANAGEMENT, L.L.C.|13F-HR|2015-11-16|edgar\/data\/1000097\/0000919574-15-008246.txt\r\n1000097|KINGDON CAPITAL MANAGEMENT, L.L.C.|SC 13G|2015-12-23|edgar\/data\/1000097\/0000919574-15-008800.txt\r\n1000177|NORDIC AMERICAN TANKERS Ltd|6-K|2015-11-12|edgar\/data\/1000177\/0000919574-15-007734.txt\r\n1000177|NORDIC AMERICAN TANKERS Ltd|6-K|2015-11-12|edgar\/data\/1000177\/0000919574-15-007742.txt\r\n1000177|NORDIC AMERICAN TANKERS Ltd|SC 13D\/A|2015-11-17|edgar\/data\/1000177\/0000919574-15-008279.txt\r\n...\r\n...<\/pre>\n<p>The last field on a line in the main body of the index file shows the path to a real raw text filing. What we have to do in the first step is to download and parse all index files and write the content into a database. Then in the second step, we can execute any query into the database (e.g., select certain form type or certain period of time) and download raw text filings using selected paths.<\/p>\n<p>I write the following Python program to execute the first step. This program borrows some codes from Edouard Swiac&#8217;s Python module &#8220;python-edgar&#8221; (version: 1.0).\u00a0Please\u00a0see his\u00a0package information page <a href=\"https:\/\/pypi.python.org\/pypi\/python-edgar\/1.0\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a>.<\/p>\n<p>Please note: my program stores all paths in an SQLite database. I personally like the lightweight database product very much. The last few lines of my program transfer data from the SQLite database to an Stata dataset for users who are not familiar with SQLite. To do so, I use two Python modules: `pandas` and `sqlalchemy` which you have to install using `pip` command on your own. Please google documentations of SQLite, Pandas, and SQLAchemy if you have installation problems.\u00a0I am using Python 3.x in all my Python posts.<\/p>\n<pre class=\"lang:python decode:true\">import datetime\r\nimport ftplib\r\nimport sqlite3\r\nimport tempfile\r\nimport zipfile\r\n\r\n# Generate the list of quarterly zip files archived in EDGAR since\r\n# start_year (earliest: 1993) until the most recent quarter\r\ncurrent_year = datetime.date.today().year\r\ncurrent_quarter = (datetime.date.today().month - 1) \/\/ 3 + 1\r\nstart_year = 1993\r\nyears = list(range(start_year, current_year))\r\nquarters = ['QTR1', 'QTR2', 'QTR3', 'QTR4']\r\nhistory = [(y, q) for y in years for q in quarters]\r\nfor i in range(1, current_quarter):\r\n    history.append((current_year, 'QTR%d' % i))\r\nquarterly_files = ['edgar\/full-index\/%d\/%s\/master.zip' % (x[0], x[1]) for x\r\n                   in history]\r\nquarterly_files.sort()\r\n\r\n# Generate the list of daily index files archived in EDGAR for the most\r\n# recent quarter\r\nftp = ftplib.FTP('ftp.sec.gov')\r\nftp.login()\r\ndaily_files = [f for f in ftp.nlst('edgar\/daily-index') if\r\n               f.startswith('edgar\/daily-index\/master')]\r\ndaily_files.sort()\r\n\r\n# Download index files and write content into SQLite\r\ncon = sqlite3.connect('edgar_idx.db')\r\ncur = con.cursor()\r\ncur.execute('DROP TABLE IF EXISTS idx')\r\ncur.execute('CREATE TABLE idx (cik TEXT, conm TEXT, type TEXT, date TEXT, '\r\n            'path TEXT)')\r\n\r\nfor file in quarterly_files:\r\n    with tempfile.TemporaryFile() as temp:\r\n        ftp.retrbinary('RETR %s' % file, temp.write)\r\n        with zipfile.ZipFile(temp).open('master.idx') as z:\r\n            for i in range(10):\r\n                z.readline()\r\n            records = [tuple(line.decode('latin-1').rstrip().split('|')) for\r\n                       line in z]\r\n    cur.executemany('INSERT INTO idx VALUES (?, ?, ?, ?, ?)', records)\r\n    print(file, 'downloaded and wrote to SQLite')\r\n\r\nfor file in daily_files:\r\n    with tempfile.TemporaryFile() as temp:\r\n        ftp.retrbinary('RETR %s' % file, temp.write)\r\n        temp.seek(0)\r\n        for i in range(7):\r\n            temp.readline()\r\n        records = [tuple(line.decode('latin-1').rstrip().split('|')) for\r\n                   line in temp]\r\n    cur.executemany('INSERT INTO idx VALUES (?, ?, ?, ?, ?)', records)\r\n    print(file, 'downloaded and wrote to SQLite')\r\n\r\ncon.commit()\r\ncon.close()\r\n\r\nftp.close()\r\n\r\n# Write SQLite database to Stata\r\nimport pandas\r\nfrom sqlalchemy import create_engine\r\n\r\nengine = create_engine('sqlite:\/\/\/edgar_idx.db')\r\nwith engine.connect() as conn, conn.begin():\r\n    data = pandas.read_sql_table('idx', conn)\r\n    data.to_stata('edgar_idx.dta')\r\n<\/pre>\n<p>I find two articles explain how to use R and Perl to achieve the same functionality. I include the links (<a href=\"https:\/\/iangow.wordpress.com\/2011\/08\/26\/getting-sec-filing-index-files\/\" target=\"_blank\" rel=\"noopener noreferrer\">R<\/a> or <a href=\"https:\/\/sbaleone.bus.miami.edu\/PERLCOURSE\/Perl_Resources.html\" target=\"_blank\" rel=\"noopener noreferrer\">Perl<\/a>) for users who are\u00a0more comfortable with R or Perl.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>[Update on 2019-07-31]\u00a0This post, together with its sibling post \u201cPart II\u201c, has been my most-viewed post since I created this website. However, the landscape of 10-K\/Q filings has changed dramatically over the past decade, and the text-format filings are extremely &hellip; <a href=\"https:\/\/www.kaichen.work\/?p=59\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7,10],"tags":[],"_links":{"self":[{"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/posts\/59"}],"collection":[{"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=59"}],"version-history":[{"count":81,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/posts\/59\/revisions"}],"predecessor-version":[{"id":1437,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/posts\/59\/revisions\/1437"}],"wp:attachment":[{"href":"https:\/\/www.kaichen.work\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=59"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=59"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=59"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}