{"id":681,"date":"2016-04-09T20:09:03","date_gmt":"2016-04-10T00:09:03","guid":{"rendered":"http:\/\/www.kaikaichen.com\/?p=681"},"modified":"2023-02-08T15:05:53","modified_gmt":"2023-02-08T20:05:53","slug":"use-python-to-download-sec-filings-on-edgar-part-ii","status":"publish","type":"post","link":"https:\/\/www.kaichen.work\/?p=681","title":{"rendered":"Use Python to download TXT-format SEC filings on EDGAR (Part II)"},"content":{"rendered":"<p><strong>[Update on 2019-07-31]<\/strong> This post, together with its sibling post &#8220;<a href=\"http:\/\/www.kaichen.work\/?p=59\" target=\"_blank\" rel=\"noopener noreferrer\">Part I<\/a>&#8220;, has been my most-viewed post since I created this website. However, the landscape of 10-K\/Q filings has changed dramatically over the past decade, and the text-format filings are extremely unfriendly for researchers nowadays. I would suggest directing our research efforts to html-format filings with the help of BeautifulSoup. The other <a href=\"http:\/\/kaichen.work\/?p=946\" target=\"_blank\" rel=\"noopener noreferrer\">post<\/a> deserves more attention.<\/p>\n<p><strong>[Update on 2017-03-03]<\/strong> SEC closed the FTP server permanently on December 30, 2016 and started to use a more secure transmission protocol\u2014https. Since then I have received several requests to update the script. Here it is the new codes for Part II.<\/p>\n<pre class=\"\">import csv\r\nimport requests\r\nimport re\r\n\r\nwith open('sample.csv', newline='') as csvfile:\r\n    reader = csv.reader(csvfile, delimiter=',')\r\n    for line in reader:\r\n        fn1 = line[0]\r\n        fn2 = re.sub(r'[\/\\\\]', '', line[1])\r\n        fn3 = re.sub(r'[\/\\\\]', '', line[2])\r\n        fn4 = line[3]\r\n        saveas = '-'.join([fn1, fn2, fn3, fn4])\r\n        # Reorganize to rename the output filename.\r\n        url = 'https:\/\/www.sec.gov\/Archives\/' + line[4].strip()\r\n        with open(saveas, 'wb') as f:\r\n            f.write(requests.get('%s' % url).content)\r\n            print(url, 'downloaded and wrote to text file')<\/pre>\n<p><strong>[Original Post]<\/strong> As I said in the post entitled &#8220;<a href=\"http:\/\/www.kaichen.work\/?p=59\" target=\"_blank\" rel=\"noopener noreferrer\">Part I<\/a>&#8220;, we have to do two steps\u00a0in order to download SEC\u00a0filings on EDGAR:<\/p>\n<ol>\n<li>Find paths to raw text filings;<\/li>\n<li>Select what we want and bulk download from EDGAR using paths we have obtained in the first step.<\/li>\n<\/ol>\n<p>&#8220;<a href=\"http:\/\/www.kaichen.work\/?p=59\" target=\"_blank\" rel=\"noopener noreferrer\">Part I<\/a>&#8221; elaborates the first step. This post shares Python codes for the second step.<\/p>\n<p>In the first step, I save index files in a SQLite database as well as a Stata dataset. The index database includes all types of filings (e.g., 10-K and 10-Q). Select from the database the types that you want and export your selection into a CSV file, say &#8220;sample.csv&#8221;. To use the following Python codes, the format of the CSV file must look as follows (this example selects all 10-Ks of Apple Inc). Please note: both SQLite and Stata datasets contain an index column, and you have to delete that index column when exporting your selection into a CSV file.<\/p>\n<pre class=\"lang:default decode:true\">320193,APPLE COMPUTER INC,10-K,1994-12-13,edgar\/data\/320193\/0000320193-94-000016.txt\r\n320193,APPLE COMPUTER INC,10-K,1995-12-19,edgar\/data\/320193\/0000320193-95-000016.txt\r\n320193,APPLE COMPUTER INC,10-K,1996-12-19,edgar\/data\/320193\/0000320193-96-000023.txt\r\n320193,APPLE COMPUTER INC,10-K,1997-12-05,edgar\/data\/320193\/0001047469-97-006960.txt\r\n320193,APPLE COMPUTER INC,10-K,1999-12-22,edgar\/data\/320193\/0000912057-99-010244.txt\r\n320193,APPLE COMPUTER INC,10-K,2000-12-14,edgar\/data\/320193\/0000912057-00-053623.txt\r\n320193,APPLE COMPUTER INC,10-K,2002-12-19,edgar\/data\/320193\/0001047469-02-007674.txt\r\n320193,APPLE COMPUTER INC,10-K,2003-12-19,edgar\/data\/320193\/0001047469-03-041604.txt\r\n320193,APPLE COMPUTER INC,10-K,2004-12-03,edgar\/data\/320193\/0001047469-04-035975.txt\r\n320193,APPLE COMPUTER INC,10-K,2005-12-01,edgar\/data\/320193\/0001104659-05-058421.txt\r\n320193,APPLE COMPUTER INC,10-K,2006-12-29,edgar\/data\/320193\/0001104659-06-084288.txt\r\n320193,APPLE INC,10-K,2007-11-15,edgar\/data\/320193\/0001047469-07-009340.txt\r\n320193,APPLE INC,10-K,2008-11-05,edgar\/data\/320193\/0001193125-08-224958.txt\r\n320193,APPLE INC,10-K,2009-10-27,edgar\/data\/320193\/0001193125-09-214859.txt\r\n320193,APPLE INC,10-K,2010-10-27,edgar\/data\/320193\/0001193125-10-238044.txt\r\n320193,APPLE INC,10-K,2011-10-26,edgar\/data\/320193\/0001193125-11-282113.txt\r\n320193,APPLE INC,10-K,2012-10-31,edgar\/data\/320193\/0001193125-12-444068.txt\r\n320193,APPLE INC,10-K,2013-10-30,edgar\/data\/320193\/0001193125-13-416534.txt\r\n320193,APPLE INC,10-K,2014-10-27,edgar\/data\/320193\/0001193125-14-383437.txt\r\n320193,APPLE INC,10-K,2015-10-28,edgar\/data\/320193\/0001193125-15-356351.txt<\/pre>\n<p>Then we can let Python complete the bulk download task:<\/p>\n<pre class=\"lang:python decode:true\">import csv\r\nimport ftplib\r\n\r\nftp = ftplib.FTP('ftp.sec.gov')\r\nftp.login()\r\n\r\nwith open('sample.csv', newline='') as csvfile:\r\n    reader = csv.reader(csvfile, delimiter=',')\r\n    for line in reader:\r\n        saveas = '-'.join([line[0], line[2], line[3]])\r\n        # Reorganize to rename the output filename.\r\n        path = line[4].strip()\r\n        with open(saveas, 'wb') as f:\r\n            ftp.retrbinary('RETR %s' % path, f.write)\r\n\r\nftp.close()<\/pre>\n<p>I do not take care of file directories of &#8220;sample.csv&#8221; and output raw text filings in the codes. You can modify by yourself. `saveas = &#8216;-&#8216;.join([line[0], line[2], line[3]])` is used to name the output SEC filings. The current name is `cik-form type-filing date.txt`. Please move around these elements to accommodate your needs (thank Eva for letting me know a previous error here).<\/p>\n","protected":false},"excerpt":{"rendered":"<p>[Update on 2019-07-31] This post, together with its sibling post &#8220;Part I&#8220;, has been my most-viewed post since I created this website. However, the landscape of 10-K\/Q filings has changed dramatically over the past decade, and the text-format filings are &hellip; <a href=\"https:\/\/www.kaichen.work\/?p=681\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7,10],"tags":[],"_links":{"self":[{"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/posts\/681"}],"collection":[{"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=681"}],"version-history":[{"count":38,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/posts\/681\/revisions"}],"predecessor-version":[{"id":1629,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/posts\/681\/revisions\/1629"}],"wp:attachment":[{"href":"https:\/\/www.kaichen.work\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=681"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=681"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=681"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}