Use Python to download TXT-format SEC filings on EDGAR (Part II)

Posted on April 9, 2016 by Kai Chen

[Update on 2019-07-31] This post, together with its sibling post “Part I“, has been my most-viewed post since I created this website. However, the landscape of 10-K/Q filings has changed dramatically over the past decade, and the text-format filings are extremely unfriendly for researchers nowadays. I would suggest directing our research efforts to html-format filings with the help of BeautifulSoup. The other post deserves more attention.

[Update on 2017-03-03] SEC closed the FTP server permanently on December 30, 2016 and started to use a more secure transmission protocol—https. Since then I have received several requests to update the script. Here it is the new codes for Part II.

import csv
import requests
import re

with open('sample.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for line in reader:
        fn1 = line[0]
        fn2 = re.sub(r'[/\\]', '', line[1])
        fn3 = re.sub(r'[/\\]', '', line[2])
        fn4 = line[3]
        saveas = '-'.join([fn1, fn2, fn3, fn4])
        # Reorganize to rename the output filename.
        url = 'https://www.sec.gov/Archives/' + line[4].strip()
        with open(saveas, 'wb') as f:
            f.write(requests.get('%s' % url).content)
            print(url, 'downloaded and wrote to text file')

import csv

import requests

import re

with open('sample.csv', newline='') as csvfile:

reader = csv.reader(csvfile, delimiter=',')

for line in reader:

fn1 = line[0]

fn2 = re.sub(r'[/\\]', '', line[1])

fn3 = re.sub(r'[/\\]', '', line[2])

fn4 = line[3]

saveas = '-'.join([fn1, fn2, fn3, fn4])

# Reorganize to rename the output filename.

url = 'https://www.sec.gov/Archives/' + line[4].strip()

with open(saveas, 'wb') as f:

f.write(requests.get('%s' % url).content)

print(url, 'downloaded and wrote to text file')

[Original Post] As I said in the post entitled “Part I“, we have to do two steps in order to download SEC filings on EDGAR:

Find paths to raw text filings;
Select what we want and bulk download from EDGAR using paths we have obtained in the first step.

“Part I” elaborates the first step. This post shares Python codes for the second step.

In the first step, I save index files in a SQLite database as well as a Stata dataset. The index database includes all types of filings (e.g., 10-K and 10-Q). Select from the database the types that you want and export your selection into a CSV file, say “sample.csv”. To use the following Python codes, the format of the CSV file must look as follows (this example selects all 10-Ks of Apple Inc). Please note: both SQLite and Stata datasets contain an index column, and you have to delete that index column when exporting your selection into a CSV file.

320193,APPLE COMPUTER INC,10-K,1994-12-13,edgar/data/320193/0000320193-94-000016.txt
320193,APPLE COMPUTER INC,10-K,1995-12-19,edgar/data/320193/0000320193-95-000016.txt
320193,APPLE COMPUTER INC,10-K,1996-12-19,edgar/data/320193/0000320193-96-000023.txt
320193,APPLE COMPUTER INC,10-K,1997-12-05,edgar/data/320193/0001047469-97-006960.txt
320193,APPLE COMPUTER INC,10-K,1999-12-22,edgar/data/320193/0000912057-99-010244.txt
320193,APPLE COMPUTER INC,10-K,2000-12-14,edgar/data/320193/0000912057-00-053623.txt
320193,APPLE COMPUTER INC,10-K,2002-12-19,edgar/data/320193/0001047469-02-007674.txt
320193,APPLE COMPUTER INC,10-K,2003-12-19,edgar/data/320193/0001047469-03-041604.txt
320193,APPLE COMPUTER INC,10-K,2004-12-03,edgar/data/320193/0001047469-04-035975.txt
320193,APPLE COMPUTER INC,10-K,2005-12-01,edgar/data/320193/0001104659-05-058421.txt
320193,APPLE COMPUTER INC,10-K,2006-12-29,edgar/data/320193/0001104659-06-084288.txt
320193,APPLE INC,10-K,2007-11-15,edgar/data/320193/0001047469-07-009340.txt
320193,APPLE INC,10-K,2008-11-05,edgar/data/320193/0001193125-08-224958.txt
320193,APPLE INC,10-K,2009-10-27,edgar/data/320193/0001193125-09-214859.txt
320193,APPLE INC,10-K,2010-10-27,edgar/data/320193/0001193125-10-238044.txt
320193,APPLE INC,10-K,2011-10-26,edgar/data/320193/0001193125-11-282113.txt
320193,APPLE INC,10-K,2012-10-31,edgar/data/320193/0001193125-12-444068.txt
320193,APPLE INC,10-K,2013-10-30,edgar/data/320193/0001193125-13-416534.txt
320193,APPLE INC,10-K,2014-10-27,edgar/data/320193/0001193125-14-383437.txt
320193,APPLE INC,10-K,2015-10-28,edgar/data/320193/0001193125-15-356351.txt

320193,APPLE COMPUTER INC,10-K,1994-12-13,edgar/data/320193/0000320193-94-000016.txt

320193,APPLE COMPUTER INC,10-K,1995-12-19,edgar/data/320193/0000320193-95-000016.txt

320193,APPLE COMPUTER INC,10-K,1996-12-19,edgar/data/320193/0000320193-96-000023.txt

320193,APPLE COMPUTER INC,10-K,1997-12-05,edgar/data/320193/0001047469-97-006960.txt

320193,APPLE COMPUTER INC,10-K,1999-12-22,edgar/data/320193/0000912057-99-010244.txt

320193,APPLE COMPUTER INC,10-K,2000-12-14,edgar/data/320193/0000912057-00-053623.txt

320193,APPLE COMPUTER INC,10-K,2002-12-19,edgar/data/320193/0001047469-02-007674.txt

320193,APPLE COMPUTER INC,10-K,2003-12-19,edgar/data/320193/0001047469-03-041604.txt

320193,APPLE COMPUTER INC,10-K,2004-12-03,edgar/data/320193/0001047469-04-035975.txt

320193,APPLE COMPUTER INC,10-K,2005-12-01,edgar/data/320193/0001104659-05-058421.txt

320193,APPLE COMPUTER INC,10-K,2006-12-29,edgar/data/320193/0001104659-06-084288.txt

320193,APPLE INC,10-K,2007-11-15,edgar/data/320193/0001047469-07-009340.txt

320193,APPLE INC,10-K,2008-11-05,edgar/data/320193/0001193125-08-224958.txt

320193,APPLE INC,10-K,2009-10-27,edgar/data/320193/0001193125-09-214859.txt

320193,APPLE INC,10-K,2010-10-27,edgar/data/320193/0001193125-10-238044.txt

320193,APPLE INC,10-K,2011-10-26,edgar/data/320193/0001193125-11-282113.txt

320193,APPLE INC,10-K,2012-10-31,edgar/data/320193/0001193125-12-444068.txt

320193,APPLE INC,10-K,2013-10-30,edgar/data/320193/0001193125-13-416534.txt

320193,APPLE INC,10-K,2014-10-27,edgar/data/320193/0001193125-14-383437.txt

320193,APPLE INC,10-K,2015-10-28,edgar/data/320193/0001193125-15-356351.txt

Then we can let Python complete the bulk download task:

import csv
import ftplib

ftp = ftplib.FTP('ftp.sec.gov')
ftp.login()

with open('sample.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for line in reader:
        saveas = '-'.join([line[0], line[2], line[3]])
        # Reorganize to rename the output filename.
        path = line[4].strip()
        with open(saveas, 'wb') as f:
            ftp.retrbinary('RETR %s' % path, f.write)

ftp.close()

import csv

import ftplib

ftp = ftplib.FTP('ftp.sec.gov')

ftp.login()

with open('sample.csv', newline='') as csvfile:

reader = csv.reader(csvfile, delimiter=',')

for line in reader:

saveas = '-'.join([line[0], line[2], line[3]])

# Reorganize to rename the output filename.

path = line[4].strip()

with open(saveas, 'wb') as f:

ftp.retrbinary('RETR %s' % path, f.write)

ftp.close()

I do not take care of file directories of “sample.csv” and output raw text filings in the codes. You can modify by yourself. saveas = '-'.join([line[0], line[2], line[3]]) is used to name the output SEC filings. The current name is cik-form type-filing date.txt. Please move around these elements to accommodate your needs (thank Eva for letting me know a previous error here).

This entry was posted in Data, Python. Bookmark the permalink.

59 Responses to Use Python to download TXT-format SEC filings on EDGAR (Part II)

Kostas says:

May 19, 2016 at 9:22 am

Really enjoyed the posts relating to EDGAR.

Would be really nice if you can do another part focusing on how to clean and prepare SEC fillings for textual analysis, or even how to extract particular items from 10K’s for example.

Reply
Frank V says:

August 11, 2016 at 9:38 pm

Awesome Post!

Is there a quick way to modify the python script so that it is writing the path to the xml infotable for each company? For instance, instead of outputting the following to path in the db:

“edgar/data/949623/0000949623-16-000014.txt”

The script would output the following:

“edgar/data/949623/000094962316000014/infotable.xml”

Thanks again!

Reply
- David says:
  
  November 25, 2016 at 8:23 pm
  
  Hey man can you help me with the compiling CSV step?I don’t have stata, is there a way I can use another software? Let me know if you can help, we can help each other.
  
  Reply
  - David says:
    
    November 26, 2016 at 4:17 pm
    
    Got it!
    
    Reply
David says:

November 25, 2016 at 8:22 pm

Do I have to use strata to compile my CSV file? or can I use R, SAS or even python to do that step? Can anyone help plz as long as I have the paths?

Reply
David says:

November 26, 2016 at 4:17 pm

Is it possible to compile only 1 section of each report like example 7A?

Reply
CL says:

November 27, 2016 at 5:27 pm

Thank you for sharing this knowledge!

Part 1 I executed without problem but when running Part 2 I ran into the error message ‘No such file or directory: ‘…”. I’m not certain how to troubleshoot this. If I just use the code as is, should I be expecting any sort of problem?

Reply
- jj says:
  
  November 10, 2017 at 10:51 am
  
  create a csv file called “sample.csv” copy and paste the names of files your are interested in (from first code screen) and save it in the same folder as your python file.
  
  Reply
Thomas says:

December 20, 2016 at 6:26 am

Excellent post! Thank you for sharing this information, it truly is invaluable.

Reply
- William L says:
  
  January 5, 2017 at 12:50 pm
  
  Hi Kai,
  
  thank you very much for sharing these codes. Unfortunately, the part2 code doesnt work for my python 2.7.13:
  
  I get following error messages:
  File “E:\William\Download_Edgar.py”, line 10, in
  with open(‘edgar_idx_sort_final.csv’, newline=”) as csvfile:
  TypeError: ‘newline’ is an invalid keyword argument for this function
  
  If I successfully circumvent the above error: I get following error messages:
  Traceback (most recent call last):
  File “E:\William\Download_Edgar.py”, line 18, in
  http://ftp.retrbinary(‘RETR %s’ % path, f.write)
  File “E:\William\Python\lib\ftplib.py”, line 414, in retrbinary
  conn = self.transfercmd(cmd, rest)
  File “E:\William\Python\lib\ftplib.py”, line 376, in transfercmd
  return self.ntransfercmd(cmd, rest)[0]
  File “E:\William\Python\lib\ftplib.py”, line 339, in ntransfercmd
  resp = self.sendcmd(cmd)
  File “E:\William\Python\lib\ftplib.py”, line 249, in sendcmd
  return self.getresp()
  File “E:\William\Python\lib\ftplib.py”, line 224, in getresp
  raise error_perm, resp
  error_perm: 550 path: No such file or directory
  
  Would be great if you can give me a hint for the second error messages. Thank you and happy new year!
  
  Best
  William
  
  Reply
  - Kai Chen says:
    
    January 5, 2017 at 7:13 pm
    
    Hi William, I wrote the code using Python 3. It looks like a compatibility issue. You can try Python 3.
    
    Reply
Greg says:

January 18, 2017 at 5:49 pm

Do you know how this works with the SEC turning off the FTP services? Thanks so much for the help!

Reply
- Kai Chen says:
  
  March 3, 2017 at 2:58 pm
  
  Code updated.
  
  Reply
Liyan says:

February 11, 2017 at 7:22 pm

Hi Kai, thanks so much for sharing these great codes. I am a python newbie and trying to modify your code to suit the new SEC website post the FTP server shutdown. I having trying to replace ftplib with either httplib2 or urllib2. But it is not exactly working out, specifically I am having difficulties saving it as a file. Would you be kind enough to help out? Thank you!

from __future__ import print_function

import csv
import httplib2

h = httplib2.Http(‘www.sec.gov’)

with open(‘sample.csv’, newline=”) as csvfile:
reader = csv.reader(csvfile, delimiter=’,’)
for line in reader:
saveas = ‘-‘.join([line[0], line[1], line[2], line[3]])
# Reorganize to rename the output filename.
path = line[4].strip()
with open(saveas, ‘wb’) as f:

Reply
- Kai Chen says:
  
  March 3, 2017 at 11:58 am
  
  I’ve updated the code. I use the module requests, but other modules such as httplib2 and urllib should also work.
  
  Reply
Claire says:

March 9, 2017 at 2:26 pm

Thanks for the code! Is there anything wrong 2011 Q4? I get the files for every quarter except this one.

Reply
- Kai Chen says:
  
  March 9, 2017 at 7:42 pm
  
  Looks like the 2011 Q4 included in the Stata dataset. Can you provide more details?
  
  Reply
Eva says:

March 10, 2017 at 7:05 pm

Hi Chen,
I use your python 3 code to download sec 10-K file. However the following error showed up:
—————————————————————————
FileNotFoundError Traceback (most recent call last)
in ()
8 # Reorganize to rename the output filename.
9 url = ‘https://www.sec.gov/Archives/’ + line[4].strip()
—> 10 with open(saveas, ‘wb’) as f:
11 f.write(requests.get(‘%s’ % url).content)
12 print(url, ‘downloaded and wrote to text file’)

FileNotFoundError: [Errno 2] No such file or directory: ‘147-101320-UJB FINANCIAL CORP /NJ/-10-K’

This is after successful download of a couple of txt files. Is it possibly because the url has changed or updated?

Reply
- Kai Chen says:
  
  March 10, 2017 at 9:13 pm
  
  Thanks for letting me know the bug. This is because the company name contains special characters that are not allowed to be used in a text file name. In this code line: saveas = ‘-‘.join([line[0], line[1], line[2], line[3]]), line[0] represents the first element in one line/record of the CSV file (in your case it is 147); line[1] represents the second element (in your case, 101320), and so on. To solve the problem, remove the company name element (in you case, line[2]). I would suggest you change the code line to: saveas = ‘-‘.join([line[0], line[1], line[3], line[4]]).
  
  Reply
João Lago says:

May 10, 2017 at 7:56 pm

Dear Kai,

First, thank you very much for your codes. I have been trying to use your updated script using Python 3.0. However, I get the following error when I run the line “import request”:
“ImportError: No module named request ”
I tried installing the package but i am having some difficulties with it. Would you know how to solve this issue for Python 3.0. ?

In terms of the information that I need to extract, I just need the date of the Filing, the CUSIP of the targeted firm and the Percentage of ownership recorded (if this helps on solving the issue).

Thank you very much for your work.

Best regards,

João Lago

Reply
- Kai Chen says:
  
  May 19, 2017 at 1:05 pm
  
  The Request module documentation or Google would be a better resource for troubleshooting. If you want to extract specific information from text filings, my current posts cannot help you. To achieve that, you should learn something called “regular expression”. Data collection is a huge cost for research. If you only need CUSIP and percentage of ownership (what ownership?), are you sure extracting them from raw text filings is the best way to go?
  
  Reply
Christian Ramos says:

May 11, 2017 at 5:19 pm

Dear Kai,

I had a similar problem to Eva, but I couldn’t fix it myself. Below is a summary of the
code and the error:

import csv
import requests
with open(‘sample.csv’, newline=”) as csvfile:
reader = csv.reader(csvfile, delimiter=’,’)
for line in reader:
saveas = ‘-‘.join([line[0], line[2], line[3]])
url = ‘https://www.sec.gov/Archives/’ + line[4].strip()
with open(saveas, ‘wb’) as f:
f.write(requests.get(‘%s’ % url).content)
print(url, ‘downloaded and wrote to text file’)

5199
https://www.sec.gov/Archives/edgar/data/1000045/0001436857-15-000014.txt downloaded and wrote to text file
Traceback (most recent call last):
File “”, line 6, in
with open(saveas, ‘wb’) as f:
FileNotFoundError: [Errno 2] No such file or directory: ‘1000045-SC 13D/A-2015-04-24’

As you can see, the first filing was extracted, but the next one didn’t. How can I solve this ?

Thank you very much.

Kind regards,

Christian Ramos

Reply
- Kai Chen says:
  
  May 14, 2017 at 10:09 pm
  
  Hi Christian, this is because the forward slash is not allowed in the file name. I have updated my script and made it more error-proof. Please use the new script.
  
  Reply
  - Christian Ramos says:
    
    May 15, 2017 at 5:53 am
    
    Dear Kai,
    
    Thank you for your reply. The script works perfectly fine !
    
    Do you have any advice on how to compile this data into treatable excel file ?
    
    Thank you very much for your work.
    
    Kind regards,
    
    Christian Ramos
    
    Reply
    - Kai Chen says:
      
      May 19, 2017 at 12:48 pm
      
      I am not sure what you want to do exactly. If you want to extract specific texts from those filings, that’s not what my current posts can help you. But to do that, you can turn to something called “regular expression”. Regular expression is independent of programming language. There is a misconception that only PERL can do text extractions. That’s deadly wrong. In my opinion, people will forget PERL once they appreciate the simplicity and readability of Python. Many programming languages, including Python, support regular expression pretty well nowadays. In accounting and finance research, many textual analysis tutorials are using PERL not because PERL is way better, but simply because PERL is older. That being said, writing good regular expression patterns is a work of art. I would suggest borrowing regular expression patterns from those PERL tutorials and then asking Python to do the rest.
      
      Reply
Tiago Ferreira says:

July 14, 2017 at 2:14 pm

Hello,
Thank you so much for the post.

I need to download also the 20f fillings,
Any ideas of how to do that?

Thanks

Reply
- Kai Chen says:
  
  July 20, 2017 at 4:10 pm
  
  Select all 20F from the index file.
  
  Reply
Mostak Ahamed says:

August 2, 2017 at 7:08 pm

Hi Kai,
Thanks for the Stata edgar_idx.dta.
I am not proficient in python. Could I fetch text file from SEC using Stata while using the path idx?

thanks

Reply
- Kai Chen says:
  
  August 8, 2017 at 12:39 pm
  
  Probably not.
  
  Reply
Bei says:

August 11, 2017 at 11:36 pm

Hi Kai,

Thank you so much for sharing the code! While I am fetching the file from SEC, I get the warning message. “HTTPError: HTTP Error 429: Too Many Requests”. Then I tried to browse SEC website, and it gave me the following message.

“You’ve Exceeded the SEC’s Traffic Limit

Your request rate has exceeded the SEC’s maximum allowable requests per second. Your access to SEC.gov will be limited for 10 minutes.

Current guidelines limit each user to a total of no more than 10 requests per second, regardless of the number of machines used to submit requests. To ensure that SEC.gov remains available to all users, we reserve the right to block IP addresses that submit excessive requests.”

Could I know whether you encounter this problem before? Do you have any suggestions to solve it? Thanks!

Reply
- Kai Chen says:
  
  September 18, 2017 at 11:43 am
  
  In this case, you have to pace your program, i.e., adding pause to slow down the program so that it will not run over the limit. You can import the time module and e.g. using time.sleep(25). This instructs the program to pause for 25 seconds.
  
  Reply
Rob says:

September 11, 2017 at 5:12 pm

Kai,

I cannot get the newline operator to work properly. Thus, I incur an IndexError as my indexes are out of range. Could there be something wrong with my csv file?

Reply
- Kai Chen says:
  
  September 18, 2017 at 11:37 am
  
  I don’t know exactly what went wrong on your side. I’m using a Mac computer. If you’re using a Windows machine, Windows and Mac use different newline character. Maybe that’s the reason. Please check Python documentation.
  
  Reply
Jim says:

January 29, 2018 at 9:12 pm

Hi Kai,

Excellent article, thank you for posting it! I have followed your instructions and created my sample.csv with the same info as your example above. When I try to download the files, I get an error for each one. Here is an example:

Start fetching URL to 10-K 10/28/15 filed on edgar/data/320193/0001193125-15-356351.txt …
Error! 2018-01-29 20:11:34 –> 2018-01-29 20:11:36

Could you please assist?

Reply
- Kai Chen says:
  
  February 19, 2018 at 1:28 pm
  
  You used the wrong code—The code you used was for downloading HTML-format 10-K, not for text-format one.
  
  Reply
Philip Bastiansen says:

August 12, 2018 at 10:25 am

Hi Kai, once again – fantastic guide with a very simplistic code that makes it possible for someone like me to follow each of the steps. I have encountered a problem, that has also been raised in a different comment – but the answer doesn’t seem to apply to my context. I get the following:

Traceback (most recent call last):
File “C:/Users/PPB92/PycharmProjects/Projekt/Main.py”, line 10, in
with open(saveas, ‘wb’) as f:
FileNotFoundError: [Errno 2] No such file or directory: ‘1005274-TECHNOLOGY SERVICE GROUP INC \\DE\\-10-K-1996-06-25’

As far as I can see, the problem is that there are back slashes, which should not be allowed. However, I have tried to play around with the “replace” but i can’t seem to replace a \ with a ‘-‘. Is there any way to make it an “and” or an “or” argument, so that it either replaces a / or a \? Any help would be greatly appreciated 🙂

Reply
- Kai Chen says:
  
  August 12, 2018 at 11:03 am
  
  Hi Philip, the error may be caused by the difference between Windows and MacOS. You gave me a good suggestion. I have updated the code and hope it is more error-proof regardless of OS.
  
  Reply
  - Philip Bastiansen says:
    
    August 12, 2018 at 11:15 am
    
    Hi Kai, awesome response time! Guess we all work on sundays, haha.
    
    I’ll look into your code and post a response at some point. Need to understand what the re.sub does, but it seems like you’re replacing all \ or / with a blank space.
    
    Reply
    - Kai Chen says:
      
      August 12, 2018 at 11:18 am
      
      You are right. re.sub is more powerful than str.replace.
      
      Reply
      - Philip Bastiansen says:
        
        August 12, 2018 at 11:22 am
        
        Unfortunately it didn’t seem to work, as it still saves with a forward slash.
        
        ‘5-1000045-NICHOLAS FINANCIAL INC-SC 13G/A’
      - Kai Chen says:
        
        August 12, 2018 at 11:25 am
        
        Please remove the first column (the index column) when you prepare sample.csv
Philip says:

August 26, 2018 at 9:52 am

Hi again Kai!

Thank you for the last advice regarding the MAC –> windows interpretation of the non-allowable characters! I have run into an issue regarding some “timeout error”, and wanted your input:

TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has fail

Is this a connection error on my end, or the SEC’s? It doesn’t seem like the issue mentioned previously by @Bei , regarding an SEC timeout.

Reply
stu says:

January 29, 2019 at 1:21 am

Hi,
First of all, a million thanks for this site. It is a life saver.
I’m new to python and I’m using pycharm. I ran this code verbatim after part 1:
import csv
import requests
import re

with open(‘sample.csv’, newline=”) as csvfile:
reader = csv.reader(csvfile, delimiter=’,’)
for line in reader:
fn1 = line[0]
fn2 = re.sub(r'[/\\]’, ”, line[1])
fn3 = re.sub(r'[/\\]’, ”, line[2])
fn4 = line[3]
saveas = ‘-‘.join([fn1, fn2, fn3, fn4])
# Reorganize to rename the output filename.
url = ‘https://www.sec.gov/Archives/’ + line[4].strip()
with open(saveas, ‘wb’) as f:
f.write(requests.get(‘%s’ % url).content)
print(url, ‘downloaded and wrote to text file’)

but I received this error:
File “C:/Users/stu/Desktop/untitled4/alt2.py”, line 5, in
with open(‘sample.csv’, newline=”) as csvfile:
FileNotFoundError: [Errno 2] No such file or directory: ‘sample.csv’

Reply
- Yang says:
  
  March 18, 2019 at 11:51 pm
  
  You need to generate your “Sample.csv” file first.
  
  Reply
Vr says:

February 7, 2019 at 3:02 pm

Can you please let me know if the script extracts sections of 10-K filing too? Thank you.

Reply
- Kai Chen says:
  
  February 13, 2019 at 4:24 pm
  
  No
  
  Reply
Naser says:

February 18, 2019 at 2:23 pm

Hi Kai

Thank you for posting the code to download SEC filing. I am new to python. When running part II, I get an error message after the following command:

>>> reader = csv.reader(csvfile, delimiter=’,’)
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘csvfile’ is not defined

Your help is highly appreciated.

Reply
- Kai Chen says:
  
  March 19, 2019 at 12:13 am
  
  “with open(‘sample.csv’, newline=”) as csvfile” assigns sample.csv to the file object csvfile. You got the error probably because you didn’t place a sample.csv in the working directory.
  
  Reply
Yang says:

March 18, 2019 at 11:50 pm

Traceback (most recent call last):
File “D:\PY4e\edgar2.py”, line 15, in
with open(saveas, ‘wb’) as f:
FileNotFoundError: [Errno 2] No such file or directory: ‘71691-NEW YORK TIMES CO-CORRESP-12/12/2018’
[Finished in 0.369s]

It seems Python cannot remember the file name.

Reply
- Kai Chen says:
  
  March 19, 2019 at 12:07 am
  
  In sample.csv, change the date format from 12/12/2018 to anything without back or forward slash, e.g., 2018-12-12 or 20181212, the code should go through. Slashes will be interpreted as directory structure by Windows and they cannot appear in file name.
  
  Reply
Yang says:

March 19, 2019 at 11:42 pm

After hours test. I finally figure out the issues related to your Python script. Unfortunately, your code will not create multiple files to save the financial statements. Indeed, the code works but it only saves the last file in the sample.csv.
you may add certain sentences to make it work.
for instance, with open(temOutfilename,’r’) as f:
for anothertimeprefix in anotherOneline:
if reporttimeprefix in anotherOneline:
reportTime=anotherOneline.replace(reporttimefix,”).strip()
if companynameprefix in anotherOneline:
companyName=anotherOneline.replace(companyNamepreffix,”).strip()
outputFileName=datapatch+’/’+companyName+’_’reportTime+’.txt’

Reply
- Kai Chen says:
  
  March 20, 2019 at 1:09 am
  
  I did test my code before and it saved each file without issues. If you can only save the last one, the only reason I can imagine for now is that you indented “with open … as f” incorrectly so that it lied outside the loop and only took the last url in.
  
  Reply
Grga says:

June 8, 2019 at 8:03 am

What ever i did it does not work for me. If some one could be please so kind and copy paste the latest code that is working. Many thx in advance. P.S. i am a noob

Reply
- Grga says:
  
  June 8, 2019 at 8:36 am
  
  working now, my bad. However, any idea how to extract income statement from 10-k or q? any tip is appreciated:)
  
  Reply
Jerry says:

June 23, 2019 at 9:47 am

Thanks for the code. I was having problems transforming the data from sqlite to csv, so ended up transforming the data to a dataframe instead.

import pandas as pd
con = sqlite3.connect(‘edgar_idx.db’)
df = pd.read_sql_query(“select * from idx;”, con)
con.close()

Does this produce the same data?

Reply
Karolina says:

July 31, 2019 at 9:05 am

Hi Kai!
I am doing a project in which I have to extract section 1A. Risk Factor from 10-K reports. I followed your instructions, I downloaded 13000 files and I can open them in Word, they are perfectly readable there, but If I open them in a text editor or try to use a Beautiful Soup to extract only txt data from them, it seems like 90% of txt is encoded in XBRL and I have no idea how to decode it into string. I uploaded a sample file here: https://gofile.io/?c=SKrlf6 . Could you please have look? Maybe you know how to decode it? I am lost 🙁

Reply
- Kai Chen says:
  
  July 31, 2019 at 9:38 pm
  
  Thanks for raising the question and letting me know the percentage of xbrl-style txt filings. The landscape of 10-K/Q filings has changed dramatically over the past decade (txt -> html -> html + xbrl -> ixbrl). Today’s txt-format 10-K/Q is totally different from 20 years ago. The parsing method you have seen in literature may not work any more. I would suggest you forget txt format and start with html format with the help of BeautifulSoup.
  
  Reply
Victor says:

August 2, 2019 at 2:30 pm

Kai is right that it is easier using the html format with BeautifulSoup. BeautifulSoup has a one liner that extracts all the text from a html page.

Since you have downloaded the txt file, you can also use BeautifulSoup to extract text from the txt file. The txt file actually includes the html file, all the exhibits, and the xbrl attachments. The html file is contained between

10-K
and the first in the txt file.

That is, the first document in the txt file is the html file, i.e., the main body of the 10-K filing. If you copy
10-K …. to a new txt file in NotePad, save it as txt, and then change the extension to “htm” or “html”, and open it with Chrome or IE. You will find that is exactly the html file.

So you can try something like this. I haven’t tested it, but I have a feeling that it should work.

(1) Open the txt file as a string
(2) Extract the first DOCUMENT, something like .*?., as a string
(3) Convert the DOCUMENT into a SOUP object.
Then, you can extract all the text using the one liner.

Good luck!

Reply
- Victor says:
  
  August 2, 2019 at 6:23 pm
  
  Did not realize that some tags are not shown. I’m going to use “( )” instead of angular brackets, because angular brackets are used to indicate htm tags and are disregarded by the browser.
  
  The body of the 10-K:
  
  (DOCUMENT)
  (TYPE)10-K
  …
  …
  …
  (DOCUMENT)
  
  The following regex matches it: r”(Document).*?(/Document)”
  
  Remember to change () to angular brackets, and set the flag to re.S or re.DOTALL, or put (?s) before the regex. The first match is usually the main body of the 10-K, and is the same as that in the html file.
  
  You can also be more specific with the regex by including the (TYPE)10-K also. Note that 10-K has other variants such as 10-K405. This should work for all:
  
  (Document)\n(TYPE)10.*?(/Document)
  
  Remember to change to angular brackets.
  
  Reply

Use Python to download TXT-format SEC filings on EDGAR (Part II)

59 Responses to Use Python to download TXT-format SEC filings on EDGAR (Part II)

Leave a Reply Cancel reply

Categories

Archives

Site Admin