Use Python to download TXT-format SEC filings on EDGAR (Part II)

[Update on 2019-07-31] This post, together with its sibling post “Part I“, has been my most-viewed post since I created this website. However, the landscape of 10-K/Q filings has changed dramatically over the past decade, and the text-format filings are extremely unfriendly for researchers nowadays. I would suggest directing our research efforts to html-format filings with the help of BeautifulSoup. The other post deserves more attention.

[Update on 2017-03-03] SEC closed the FTP server permanently on December 30, 2016 and started to use a more secure transmission protocol—https. Since then I have received several requests to update the script. Here it is the new codes for Part II.

[Original Post] As I said in the post entitled “Part I“, we have to do two steps in order to download SEC filings on EDGAR:

  1. Find paths to raw text filings;
  2. Select what we want and bulk download from EDGAR using paths we have obtained in the first step.

Part I” elaborates the first step. This post shares Python codes for the second step.

In the first step, I save index files in a SQLite database as well as a Stata dataset. The index database includes all types of filings (e.g., 10-K and 10-Q). Select from the database the types that you want and export your selection into a CSV file, say “sample.csv”. To use the following Python codes, the format of the CSV file must look as follows (this example selects all 10-Ks of Apple Inc). Please note: both SQLite and Stata datasets contain an index column, and you have to delete that index column when exporting your selection into a CSV file.

Then we can let Python complete the bulk download task:

I do not take care of file directories of “sample.csv” and output raw text filings in the codes. You can modify by yourself. saveas = '-'.join([line[0], line[2], line[3]]) is used to name the output SEC filings. The current name is cik-form type-filing date.txt. Please move around these elements to accommodate your needs (thank Eva for letting me know a previous error here).

This entry was posted in Data, Python. Bookmark the permalink.

59 Responses to Use Python to download TXT-format SEC filings on EDGAR (Part II)

  1. Kostas says:

    Really enjoyed the posts relating to EDGAR.

    Would be really nice if you can do another part focusing on how to clean and prepare SEC fillings for textual analysis, or even how to extract particular items from 10K’s for example.

  2. Frank V says:

    Awesome Post!

    Is there a quick way to modify the python script so that it is writing the path to the xml infotable for each company? For instance, instead of outputting the following to path in the db:

    “edgar/data/949623/0000949623-16-000014.txt”

    The script would output the following:

    “edgar/data/949623/000094962316000014/infotable.xml”

    Thanks again!

  3. David says:

    Do I have to use strata to compile my CSV file? or can I use R, SAS or even python to do that step? Can anyone help plz as long as I have the paths?

  4. David says:

    Is it possible to compile only 1 section of each report like example 7A?

  5. CL says:

    Thank you for sharing this knowledge!

    Part 1 I executed without problem but when running Part 2 I ran into the error message ‘No such file or directory: ‘…”. I’m not certain how to troubleshoot this. If I just use the code as is, should I be expecting any sort of problem?

    • jj says:

      create a csv file called “sample.csv” copy and paste the names of files your are interested in (from first code screen) and save it in the same folder as your python file.

  6. Thomas says:

    Excellent post! Thank you for sharing this information, it truly is invaluable.

    • William L says:

      Hi Kai,

      thank you very much for sharing these codes. Unfortunately, the part2 code doesnt work for my python 2.7.13:

      I get following error messages:
      File “E:\William\Download_Edgar.py”, line 10, in
      with open(‘edgar_idx_sort_final.csv’, newline=”) as csvfile:
      TypeError: ‘newline’ is an invalid keyword argument for this function

      If I successfully circumvent the above error: I get following error messages:
      Traceback (most recent call last):
      File “E:\William\Download_Edgar.py”, line 18, in
      http://ftp.retrbinary(‘RETR %s’ % path, f.write)
      File “E:\William\Python\lib\ftplib.py”, line 414, in retrbinary
      conn = self.transfercmd(cmd, rest)
      File “E:\William\Python\lib\ftplib.py”, line 376, in transfercmd
      return self.ntransfercmd(cmd, rest)[0]
      File “E:\William\Python\lib\ftplib.py”, line 339, in ntransfercmd
      resp = self.sendcmd(cmd)
      File “E:\William\Python\lib\ftplib.py”, line 249, in sendcmd
      return self.getresp()
      File “E:\William\Python\lib\ftplib.py”, line 224, in getresp
      raise error_perm, resp
      error_perm: 550 path: No such file or directory

      Would be great if you can give me a hint for the second error messages. Thank you and happy new year!

      Best
      William

  7. Greg says:

    Do you know how this works with the SEC turning off the FTP services? Thanks so much for the help!

  8. Liyan says:

    Hi Kai, thanks so much for sharing these great codes. I am a python newbie and trying to modify your code to suit the new SEC website post the FTP server shutdown. I having trying to replace ftplib with either httplib2 or urllib2. But it is not exactly working out, specifically I am having difficulties saving it as a file. Would you be kind enough to help out? Thank you!

    from __future__ import print_function

    import csv
    import httplib2

    h = httplib2.Http(‘www.sec.gov’)

    with open(‘sample.csv’, newline=”) as csvfile:
    reader = csv.reader(csvfile, delimiter=’,’)
    for line in reader:
    saveas = ‘-‘.join([line[0], line[1], line[2], line[3]])
    # Reorganize to rename the output filename.
    path = line[4].strip()
    with open(saveas, ‘wb’) as f:

  9. Claire says:

    Thanks for the code! Is there anything wrong 2011 Q4? I get the files for every quarter except this one.

  10. Eva says:

    Hi Chen,
    I use your python 3 code to download sec 10-K file. However the following error showed up:
    —————————————————————————
    FileNotFoundError Traceback (most recent call last)
    in ()
    8 # Reorganize to rename the output filename.
    9 url = ‘https://www.sec.gov/Archives/’ + line[4].strip()
    —> 10 with open(saveas, ‘wb’) as f:
    11 f.write(requests.get(‘%s’ % url).content)
    12 print(url, ‘downloaded and wrote to text file’)

    FileNotFoundError: [Errno 2] No such file or directory: ‘147-101320-UJB FINANCIAL CORP /NJ/-10-K’

    This is after successful download of a couple of txt files. Is it possibly because the url has changed or updated?

    • Kai Chen says:

      Thanks for letting me know the bug. This is because the company name contains special characters that are not allowed to be used in a text file name. In this code line: saveas = ‘-‘.join([line[0], line[1], line[2], line[3]]), line[0] represents the first element in one line/record of the CSV file (in your case it is 147); line[1] represents the second element (in your case, 101320), and so on. To solve the problem, remove the company name element (in you case, line[2]). I would suggest you change the code line to: saveas = ‘-‘.join([line[0], line[1], line[3], line[4]]).

  11. João Lago says:

    Dear Kai,

    First, thank you very much for your codes. I have been trying to use your updated script using Python 3.0. However, I get the following error when I run the line “import request”:
    “ImportError: No module named request ”
    I tried installing the package but i am having some difficulties with it. Would you know how to solve this issue for Python 3.0. ?

    In terms of the information that I need to extract, I just need the date of the Filing, the CUSIP of the targeted firm and the Percentage of ownership recorded (if this helps on solving the issue).

    Thank you very much for your work.

    Best regards,

    João Lago

    • Kai Chen says:

      The Request module documentation or Google would be a better resource for troubleshooting. If you want to extract specific information from text filings, my current posts cannot help you. To achieve that, you should learn something called “regular expression”. Data collection is a huge cost for research. If you only need CUSIP and percentage of ownership (what ownership?), are you sure extracting them from raw text filings is the best way to go?

  12. Christian Ramos says:

    Dear Kai,

    I had a similar problem to Eva, but I couldn’t fix it myself. Below is a summary of the
    code and the error:

    import csv
    import requests
    with open(‘sample.csv’, newline=”) as csvfile:
    reader = csv.reader(csvfile, delimiter=’,’)
    for line in reader:
    saveas = ‘-‘.join([line[0], line[2], line[3]])
    url = ‘https://www.sec.gov/Archives/’ + line[4].strip()
    with open(saveas, ‘wb’) as f:
    f.write(requests.get(‘%s’ % url).content)
    print(url, ‘downloaded and wrote to text file’)

    5199
    https://www.sec.gov/Archives/edgar/data/1000045/0001436857-15-000014.txt downloaded and wrote to text file
    Traceback (most recent call last):
    File “”, line 6, in
    with open(saveas, ‘wb’) as f:
    FileNotFoundError: [Errno 2] No such file or directory: ‘1000045-SC 13D/A-2015-04-24’

    As you can see, the first filing was extracted, but the next one didn’t. How can I solve this ?

    Thank you very much.

    Kind regards,

    Christian Ramos

    • Kai Chen says:

      Hi Christian, this is because the forward slash is not allowed in the file name. I have updated my script and made it more error-proof. Please use the new script.

      • Christian Ramos says:

        Dear Kai,

        Thank you for your reply. The script works perfectly fine !

        Do you have any advice on how to compile this data into treatable excel file ?

        Thank you very much for your work.

        Kind regards,

        Christian Ramos

        • Kai Chen says:

          I am not sure what you want to do exactly. If you want to extract specific texts from those filings, that’s not what my current posts can help you. But to do that, you can turn to something called “regular expression”. Regular expression is independent of programming language. There is a misconception that only PERL can do text extractions. That’s deadly wrong. In my opinion, people will forget PERL once they appreciate the simplicity and readability of Python. Many programming languages, including Python, support regular expression pretty well nowadays. In accounting and finance research, many textual analysis tutorials are using PERL not because PERL is way better, but simply because PERL is older. That being said, writing good regular expression patterns is a work of art. I would suggest borrowing regular expression patterns from those PERL tutorials and then asking Python to do the rest.

  13. Tiago Ferreira says:

    Hello,
    Thank you so much for the post.

    I need to download also the 20f fillings,
    Any ideas of how to do that?

    Thanks

  14. Mostak Ahamed says:

    Hi Kai,
    Thanks for the Stata edgar_idx.dta.
    I am not proficient in python. Could I fetch text file from SEC using Stata while using the path idx?

    thanks

  15. Bei says:

    Hi Kai,

    Thank you so much for sharing the code! While I am fetching the file from SEC, I get the warning message. “HTTPError: HTTP Error 429: Too Many Requests”. Then I tried to browse SEC website, and it gave me the following message.

    “You’ve Exceeded the SEC’s Traffic Limit

    Your request rate has exceeded the SEC’s maximum allowable requests per second. Your access to SEC.gov will be limited for 10 minutes.

    Current guidelines limit each user to a total of no more than 10 requests per second, regardless of the number of machines used to submit requests. To ensure that SEC.gov remains available to all users, we reserve the right to block IP addresses that submit excessive requests.”

    Could I know whether you encounter this problem before? Do you have any suggestions to solve it? Thanks!

    • Kai Chen says:

      In this case, you have to pace your program, i.e., adding pause to slow down the program so that it will not run over the limit. You can import the time module and e.g. using time.sleep(25). This instructs the program to pause for 25 seconds.

  16. Rob says:

    Kai,

    I cannot get the newline operator to work properly. Thus, I incur an IndexError as my indexes are out of range. Could there be something wrong with my csv file?

    • Kai Chen says:

      I don’t know exactly what went wrong on your side. I’m using a Mac computer. If you’re using a Windows machine, Windows and Mac use different newline character. Maybe that’s the reason. Please check Python documentation.

  17. Jim says:

    Hi Kai,

    Excellent article, thank you for posting it! I have followed your instructions and created my sample.csv with the same info as your example above. When I try to download the files, I get an error for each one. Here is an example:

    Start fetching URL to 10-K 10/28/15 filed on edgar/data/320193/0001193125-15-356351.txt …
    Error! 2018-01-29 20:11:34 –> 2018-01-29 20:11:36

    Could you please assist?

  18. Philip Bastiansen says:

    Hi Kai, once again – fantastic guide with a very simplistic code that makes it possible for someone like me to follow each of the steps. I have encountered a problem, that has also been raised in a different comment – but the answer doesn’t seem to apply to my context. I get the following:

    Traceback (most recent call last):
    File “C:/Users/PPB92/PycharmProjects/Projekt/Main.py”, line 10, in
    with open(saveas, ‘wb’) as f:
    FileNotFoundError: [Errno 2] No such file or directory: ‘1005274-TECHNOLOGY SERVICE GROUP INC \\DE\\-10-K-1996-06-25’

    As far as I can see, the problem is that there are back slashes, which should not be allowed. However, I have tried to play around with the “replace” but i can’t seem to replace a \ with a ‘-‘. Is there any way to make it an “and” or an “or” argument, so that it either replaces a / or a \? Any help would be greatly appreciated 🙂

    • Kai Chen says:

      Hi Philip, the error may be caused by the difference between Windows and MacOS. You gave me a good suggestion. I have updated the code and hope it is more error-proof regardless of OS.

  19. Philip says:

    Hi again Kai!

    Thank you for the last advice regarding the MAC –> windows interpretation of the non-allowable characters! I have run into an issue regarding some “timeout error”, and wanted your input:

    TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has fail

    Is this a connection error on my end, or the SEC’s? It doesn’t seem like the issue mentioned previously by @Bei , regarding an SEC timeout.

  20. stu says:

    Hi,
    First of all, a million thanks for this site. It is a life saver.
    I’m new to python and I’m using pycharm. I ran this code verbatim after part 1:
    import csv
    import requests
    import re

    with open(‘sample.csv’, newline=”) as csvfile:
    reader = csv.reader(csvfile, delimiter=’,’)
    for line in reader:
    fn1 = line[0]
    fn2 = re.sub(r'[/\\]’, ”, line[1])
    fn3 = re.sub(r'[/\\]’, ”, line[2])
    fn4 = line[3]
    saveas = ‘-‘.join([fn1, fn2, fn3, fn4])
    # Reorganize to rename the output filename.
    url = ‘https://www.sec.gov/Archives/’ + line[4].strip()
    with open(saveas, ‘wb’) as f:
    f.write(requests.get(‘%s’ % url).content)
    print(url, ‘downloaded and wrote to text file’)

    but I received this error:
    File “C:/Users/stu/Desktop/untitled4/alt2.py”, line 5, in
    with open(‘sample.csv’, newline=”) as csvfile:
    FileNotFoundError: [Errno 2] No such file or directory: ‘sample.csv’

  21. Vr says:

    Can you please let me know if the script extracts sections of 10-K filing too? Thank you.

  22. Naser says:

    Hi Kai

    Thank you for posting the code to download SEC filing. I am new to python. When running part II, I get an error message after the following command:

    >>> reader = csv.reader(csvfile, delimiter=’,’)
    Traceback (most recent call last):
    File “”, line 1, in
    NameError: name ‘csvfile’ is not defined

    Your help is highly appreciated.

    • Kai Chen says:

      “with open(‘sample.csv’, newline=”) as csvfile” assigns sample.csv to the file object csvfile. You got the error probably because you didn’t place a sample.csv in the working directory.

  23. Yang says:

    Traceback (most recent call last):
    File “D:\PY4e\edgar2.py”, line 15, in
    with open(saveas, ‘wb’) as f:
    FileNotFoundError: [Errno 2] No such file or directory: ‘71691-NEW YORK TIMES CO-CORRESP-12/12/2018’
    [Finished in 0.369s]

    It seems Python cannot remember the file name.

    • Kai Chen says:

      In sample.csv, change the date format from 12/12/2018 to anything without back or forward slash, e.g., 2018-12-12 or 20181212, the code should go through. Slashes will be interpreted as directory structure by Windows and they cannot appear in file name.

  24. Yang says:

    After hours test. I finally figure out the issues related to your Python script. Unfortunately, your code will not create multiple files to save the financial statements. Indeed, the code works but it only saves the last file in the sample.csv.
    you may add certain sentences to make it work.
    for instance, with open(temOutfilename,’r’) as f:
    for anothertimeprefix in anotherOneline:
    if reporttimeprefix in anotherOneline:
    reportTime=anotherOneline.replace(reporttimefix,”).strip()
    if companynameprefix in anotherOneline:
    companyName=anotherOneline.replace(companyNamepreffix,”).strip()
    outputFileName=datapatch+’/’+companyName+’_’reportTime+’.txt’

    • Kai Chen says:

      I did test my code before and it saved each file without issues. If you can only save the last one, the only reason I can imagine for now is that you indented “with open … as f” incorrectly so that it lied outside the loop and only took the last url in.

  25. Grga says:

    What ever i did it does not work for me. If some one could be please so kind and copy paste the latest code that is working. Many thx in advance. P.S. i am a noob

    • Grga says:

      working now, my bad. However, any idea how to extract income statement from 10-k or q? any tip is appreciated:)

  26. Jerry says:

    Thanks for the code. I was having problems transforming the data from sqlite to csv, so ended up transforming the data to a dataframe instead.

    import pandas as pd
    con = sqlite3.connect(‘edgar_idx.db’)
    df = pd.read_sql_query(“select * from idx;”, con)
    con.close()

    Does this produce the same data?

  27. Karolina says:

    Hi Kai!
    I am doing a project in which I have to extract section 1A. Risk Factor from 10-K reports. I followed your instructions, I downloaded 13000 files and I can open them in Word, they are perfectly readable there, but If I open them in a text editor or try to use a Beautiful Soup to extract only txt data from them, it seems like 90% of txt is encoded in XBRL and I have no idea how to decode it into string. I uploaded a sample file here: https://gofile.io/?c=SKrlf6 . Could you please have look? Maybe you know how to decode it? I am lost 🙁

    • Kai Chen says:

      Thanks for raising the question and letting me know the percentage of xbrl-style txt filings. The landscape of 10-K/Q filings has changed dramatically over the past decade (txt -> html -> html + xbrl -> ixbrl). Today’s txt-format 10-K/Q is totally different from 20 years ago. The parsing method you have seen in literature may not work any more. I would suggest you forget txt format and start with html format with the help of BeautifulSoup.

  28. Victor says:

    Kai is right that it is easier using the html format with BeautifulSoup. BeautifulSoup has a one liner that extracts all the text from a html page.

    Since you have downloaded the txt file, you can also use BeautifulSoup to extract text from the txt file. The txt file actually includes the html file, all the exhibits, and the xbrl attachments. The html file is contained between

    10-K
    and the first in the txt file.

    That is, the first document in the txt file is the html file, i.e., the main body of the 10-K filing. If you copy
    10-K …. to a new txt file in NotePad, save it as txt, and then change the extension to “htm” or “html”, and open it with Chrome or IE. You will find that is exactly the html file.

    So you can try something like this. I haven’t tested it, but I have a feeling that it should work.

    (1) Open the txt file as a string
    (2) Extract the first DOCUMENT, something like .*?., as a string
    (3) Convert the DOCUMENT into a SOUP object.
    Then, you can extract all the text using the one liner.

    Good luck!

    • Victor says:

      Did not realize that some tags are not shown. I’m going to use “( )” instead of angular brackets, because angular brackets are used to indicate htm tags and are disregarded by the browser.

      The body of the 10-K:

      (DOCUMENT)
      (TYPE)10-K



      (DOCUMENT)

      The following regex matches it: r”(Document).*?(/Document)”

      Remember to change () to angular brackets, and set the flag to re.S or re.DOTALL, or put (?s) before the regex. The first match is usually the main body of the 10-K, and is the same as that in the html file.

      You can also be more specific with the regex by including the (TYPE)10-K also. Note that 10-K has other variants such as 10-K405. This should work for all:

      (Document)\n(TYPE)10.*?(/Document)

      Remember to change to angular brackets.

Leave a Reply

Your email address will not be published. Required fields are marked *