[Update on 2019-07-31] This post, together with its sibling post “Part II“, has been my most-viewed post since I created this website. However, the landscape of 10-K/Q filings has changed dramatically over the past decade, and the text-format filings are extremely unfriendly for researchers nowadays. I would suggest directing our research efforts to html-format filings with the help of BeautifulSoup. The other post deserves more attention.
[Update on 2018-10-06] As I acknowledged in the very first edition of this post, I borrowed some codes from Edouard Swiac’s Python module “python-edgar” (version: 1.0). Edouard kindly informed me that he had updated his module (see his GitHub page). The major updates to his module include: (1) he migrated the file download from FTP to HTTPS and (2) added parallel downloads so now it is faster to rebuild the full index, especially if going all the way to 1993. My initial thoughts about his updated module is that it provides more flexibility and should be more robust than mine. Thank you Edouard for your work!
[Update on 2017-03-03] SEC closed the FTP server permanently on December 30, 2016 and started to use a more secure transmission protocol—https. So the description about the FTP server in the original post is not applicable any more (but the basic idea about the URLs to raw text filings remain unchanged.) Since then I have received several requests to update the script. Here it is the new script for Part I.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
# Generate the list of index files archived in EDGAR since start_year (earliest: 1993) until the most recent quarter import datetime # Please download index files chunk by chunk. For example, please first download index files during 1993–2000, then # download index files during 2001–2005 by changing the following two lines repeatedly, and so on. If you need index # files up to the most recent year and quarter, comment out the following three lines, remove the comment sign at # the starting of the next three lines, and define the start_year that immediately follows the ending year of the # previous chunk. start_year = 2011 # change start_year and end_year to re-define the chunk current_year = 2015 # change start_year and end_year to re-define the chunk current_quarter = 4 # do not change this line # start_year = 2016 # only change this line to download the most recent chunk # current_year = datetime.date.today().year # current_quarter = (datetime.date.today().month - 1) // 3 + 1 years = list(range(start_year, current_year)) quarters = ['QTR1', 'QTR2', 'QTR3', 'QTR4'] history = [(y, q) for y in years for q in quarters] for i in range(1, current_quarter + 1): history.append((current_year, 'QTR%d' % i)) urls = ['https://www.sec.gov/Archives/edgar/full-index/%d/%s/master.idx' % (x[0], x[1]) for x in history] urls.sort() # Download index files and write content into SQLite import sqlite3 import requests con = sqlite3.connect('edgar_idx.db') cur = con.cursor() cur.execute('DROP TABLE IF EXISTS idx') cur.execute('CREATE TABLE idx (cik TEXT, conm TEXT, type TEXT, date TEXT, path TEXT)') for url in urls: lines = requests.get(url).content.decode("utf-8", "ignore").splitlines() records = [tuple(line.split('|')) for line in lines[11:]] cur.executemany('INSERT INTO idx VALUES (?, ?, ?, ?, ?)', records) print(url, 'downloaded and wrote to SQLite') con.commit() con.close() # Write SQLite database to Stata import pandas from sqlalchemy import create_engine engine = create_engine('sqlite:///edgar_idx.db') with engine.connect() as conn, conn.begin(): data = pandas.read_sql_table('idx', conn) data.to_stata('edgar_idx.dta', version=117) |
The technical details may be too boring to most people. So, I provide multiple downloadable Stata datasets that include all index files from 1993 Q1 to October 6, 2018.
Stata format (1993–2000); Stata format (2001–2005); Stata format (2006–2010); Stata format (2011–2015); Stata format (2016–2019/03/16)
[Original Post] We know that SEC makes company filings (e.g.,10-Ks, 10-Qs and 8-Ks) publicly available on EDGAR. The web search interface is convenient, but we may need to bulk download raw text filings. SEC provides an anonymous EDGAR FTP server to access raw text filings. Usually, if we know the path or URL to a file on an FTP server, we can easily use an Internet browser or an FTP software to connect to the server and download the file. For example, if we navigate a bit on the EDGAR FTP server, we can find the path to the file “master.idx” as follows:
ftp://ftp.sec.gov/edgar/full-index/2015/QTR4/master.idx
Copy the path into an Internet browser or an FTP software, we can download the file directly.
In the above example, we can find the path to “master.idx” by navigating on the EDGAR FTP server. But we cannot find any path to any raw text filing. In other words, paths to raw text filings are not visible by simply looking into the EDGAR FTP server. SEC purposely hides paths to raw text filings to reduce server load and avoid data abuse.
In order to download SEC filings on EDGAR, we have to:
- Find paths to raw text filings;
- Select what we want and bulk download raw text filings from the EDGAR FTP server using paths we have obtained in the first step.
This post describes the first step, and I elaborate the second step in another post.
SEC stores all path information in index files. See technical details here. Let’s take a snap shot of an index file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
Description: Master Index of EDGAR Dissemination Feed Last Data Received: December 31, 2015 Comments: webmaster@sec.gov Anonymous FTP: ftp://ftp.sec.gov/edgar/ CIK|Company Name|Form Type|Date Filed|Filename -------------------------------------------------------------------------------- 1000028|AIM SAFETY CO INC|REVOKED|2015-12-21|edgar/data/1000028/9999999997-15-015914.txt 1000032|BINCH JAMES G|4|2015-12-02|edgar/data/1000032/0001209191-15-082911.txt 1000032|BINCH JAMES G|4|2015-12-11|edgar/data/1000032/0001209191-15-084690.txt 1000045|NICHOLAS FINANCIAL INC|10-Q|2015-11-09|edgar/data/1000045/0001193125-15-371499.txt 1000045|NICHOLAS FINANCIAL INC|8-K|2015-10-29|edgar/data/1000045/0001193125-15-357258.txt 1000045|NICHOLAS FINANCIAL INC|SC 13G/A|2015-11-03|edgar/data/1000045/0001436857-15-000017.txt 1000097|KINGDON CAPITAL MANAGEMENT, L.L.C.|13F-HR|2015-11-16|edgar/data/1000097/0000919574-15-008246.txt 1000097|KINGDON CAPITAL MANAGEMENT, L.L.C.|SC 13G|2015-12-23|edgar/data/1000097/0000919574-15-008800.txt 1000177|NORDIC AMERICAN TANKERS Ltd|6-K|2015-11-12|edgar/data/1000177/0000919574-15-007734.txt 1000177|NORDIC AMERICAN TANKERS Ltd|6-K|2015-11-12|edgar/data/1000177/0000919574-15-007742.txt 1000177|NORDIC AMERICAN TANKERS Ltd|SC 13D/A|2015-11-17|edgar/data/1000177/0000919574-15-008279.txt ... ... |
The last field on a line in the main body of the index file shows the path to a real raw text filing. What we have to do in the first step is to download and parse all index files and write the content into a database. Then in the second step, we can execute any query into the database (e.g., select certain form type or certain period of time) and download raw text filings using selected paths.
I write the following Python program to execute the first step. This program borrows some codes from Edouard Swiac’s Python module “python-edgar” (version: 1.0). Please see his package information page here.
Please note: my program stores all paths in an SQLite database. I personally like the lightweight database product very much. The last few lines of my program transfer data from the SQLite database to an Stata dataset for users who are not familiar with SQLite. To do so, I use two Python modules: pandas
and sqlalchemy
which you have to install using pip
command on your own. Please google documentations of SQLite, Pandas, and SQLAchemy if you have installation problems. I am using Python 3.x in all my Python posts.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
import datetime import ftplib import sqlite3 import tempfile import zipfile # Generate the list of quarterly zip files archived in EDGAR since # start_year (earliest: 1993) until the most recent quarter current_year = datetime.date.today().year current_quarter = (datetime.date.today().month - 1) // 3 + 1 start_year = 1993 years = list(range(start_year, current_year)) quarters = ['QTR1', 'QTR2', 'QTR3', 'QTR4'] history = [(y, q) for y in years for q in quarters] for i in range(1, current_quarter): history.append((current_year, 'QTR%d' % i)) quarterly_files = ['edgar/full-index/%d/%s/master.zip' % (x[0], x[1]) for x in history] quarterly_files.sort() # Generate the list of daily index files archived in EDGAR for the most # recent quarter ftp = ftplib.FTP('ftp.sec.gov') ftp.login() daily_files = [f for f in ftp.nlst('edgar/daily-index') if f.startswith('edgar/daily-index/master')] daily_files.sort() # Download index files and write content into SQLite con = sqlite3.connect('edgar_idx.db') cur = con.cursor() cur.execute('DROP TABLE IF EXISTS idx') cur.execute('CREATE TABLE idx (cik TEXT, conm TEXT, type TEXT, date TEXT, ' 'path TEXT)') for file in quarterly_files: with tempfile.TemporaryFile() as temp: ftp.retrbinary('RETR %s' % file, temp.write) with zipfile.ZipFile(temp).open('master.idx') as z: for i in range(10): z.readline() records = [tuple(line.decode('latin-1').rstrip().split('|')) for line in z] cur.executemany('INSERT INTO idx VALUES (?, ?, ?, ?, ?)', records) print(file, 'downloaded and wrote to SQLite') for file in daily_files: with tempfile.TemporaryFile() as temp: ftp.retrbinary('RETR %s' % file, temp.write) temp.seek(0) for i in range(7): temp.readline() records = [tuple(line.decode('latin-1').rstrip().split('|')) for line in temp] cur.executemany('INSERT INTO idx VALUES (?, ?, ?, ?, ?)', records) print(file, 'downloaded and wrote to SQLite') con.commit() con.close() ftp.close() # Write SQLite database to Stata import pandas from sqlalchemy import create_engine engine = create_engine('sqlite:///edgar_idx.db') with engine.connect() as conn, conn.begin(): data = pandas.read_sql_table('idx', conn) data.to_stata('edgar_idx.dta') |
I find two articles explain how to use R and Perl to achieve the same functionality. I include the links (R or Perl) for users who are more comfortable with R or Perl.
Hi Kai, I have found your page while checking for ways to download SEC EDGAR files. I am very new to this topic and was wondering if you could help me with some questions I have regarding the use of Python for such a massive download? I would be happy to send you an email, but please let me know if you prefer me to ask you as a comment related to your post instead. Thanks!
Thanks for the data, but I have a hard time to extract this file (around 395.95MB). The upzipped file size seems to be about 1.7GB, rather than 19GB. And I also have an error when the extracting process is close to 100%. Any thoughts?
P.S. I use 7-z to extract it.
Hi Cliff, the unzipped SAS data file is 19G on my laptop. I put a link to the mega file so that you can download it directly: https://www.dropbox.com/s/chevzd830h4vtdz/edgaridx.sas7bdat?dl=0
As to the unzipping stopping at nearly 100%, maybe you can use other unzip programs such as WinRAR for better compatibility.
Dear Kai:
I don’t know how to handle this size of data. I am a pythoner now? Any suggestion?
now i use python sas read library to solve it ..anyway ,thx
Hi Kai,
Thank you! I just saw the reply. (I thought I would receive an email from you when I got a reply:p)
I found another website also includes the Edgar masterfile (http://www.wrds.us/index.php/repository/view/25)
When I compare that one with yours, basically the two are quite similar, almost the same amount of observations. One thing I notice is that yours have ID, and the length of your variables is 255. Therefore their size is 1.9 GB, yours is 19 GB. Good job anyway!
Hi Kai,
Thanks for sharing this post. I’m facing an error when I run it with Python 2.7 (using Windows) :
AttributeError: ‘module’ object has no attribute ‘FTP_ADDR’
I tried importing edgar.FTP_ADDR but i get an import error:
ImportError: No module named FTP
Any idea how to resolve it.
Best,
Kal
Did you install the package “python-edgar” properly?
Yes, the installation was successful. When I re-install it I get:
Requirement already satisfied (use –upgrade to upgrade): python-edgar in c:\python27\lib\site-packages
This is great, but when i run the code I get “sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.”
Is there some way to covert this to utf-8?
Thanks,
are 13F filings included in this file? Thanks!
Yes
Wow, this was really helpful. It took me a while to figure out
[Errno 2] No such file or directory
I had to make sure that the directory existed.
http://stackoverflow.com/questions/2401628/open-file-in-w-mode-ioerror-errno-2-no-such-file-or-directory
Excellent work. We are a boutique Investment Bank based in Chicago and Dallas. Specializing in marketing private offerings publicly via the JOBS Act/SEC guidelines. http://www.cassonmediagroup.com
One of the tools we are working on is providing free, user-friendly access to all Private Offering filed (Form D) plus all offerings that have been posted on the “Equity Crowdfunding” sites (that have not yet filed): a manual aggregation project.
Our thinking at this point is to not only extract via FTP, etc. all the new Form D filings every day plus attached docs. But to build a “web widget” app that can be embedded on any website (similar to stock quote widgets) giving the user a fast and painless way to access/search private offerings.
Also, we have a mobile app (iPad, iPhone, IOS, Android) that would access the data.
At no point do plan to monetize access, all would be Open Source and free. We would ask name and email to access to use the app.
We would deeply appreciate any assistance that you could provide. Paid consulting gig, a 30-minute phone conference, design review, anything!
I look forward to hearing from you.
Thanks in advance,
John White
847-867-5911
Hi Kai Chen,
Great post. I was wondering where I can find the part II of this post as I want only 10-k and 10-q forms. Secondly, I’m able to create the ‘idx’ table, but somehow it’s not getting inserted with the values. Is it because, I always end up with an error?
Thanks in advance,
Tayyaba
Hi Tayyaba, I have not started part II but will do it soon. If you could send me all log information, I may be able to find why values not being inserted.
Hi Kai, Thanks for the reply. I have Mysql DB and I think the format of “records” isn’t the one which mysql excepts. So, I downloaded your .dta file and converted it into CSV format and will import the csv files into the table. But there’s one issue I’m having right now, the url or path we have in the table is in txt format however I want it in .htm format. I’m working on it.
This question on stackoverflow is the one which addresses my problem:
http://stackoverflow.com/questions/33979261/html-rendering-of-edgar-txt-filings
Hi Tayyaba,
Part II is out: http://kaichen.work/?p=681
I read your post on Stackoverflow. In my codes I retrieve “master.idx” because the goal is to download and process raw text filings for further textual analysis. To achieve your goal, you can try to retrieve another type of index file: “crawler.idx”. See the example ftp://ftp.sec.gov/edgar/daily-index/crawler.20160201.idx. That path leads to the html version. You still need some tweak (go one level deeper) but that should be doable.
Hope this helps.
Kai
Hey Kai, Thank you so much, really appreciate it.
hi Kai,
Great job. May I know why there is a loop i from 0 to 9 in the following code?
with zipfile.ZipFile(temp).open(‘master.idx’) as z:
for i in range(10):
z.readline()
Thank you!
Hi Eva, if you look into the index file ‘master.idx’, you’ll notice the first ten lines are titles, column names, etc. The loop is used to skip the first ten lines and reach the “true” records.
Thank you very much!
Hello Kai,
Thanks so much for your posting. It helps me a lot. I was thinking about doing the same thing but do not have enough skills to write a python scripts like this. I appreciate your effort.
I emulated the script on Python 3.5.2 and was able to create a sqlite db file. But I get erros when I try to export it as dta file. So I just also downloaded the dta file but want to know what is wrong. Below is the error message.
—————————————————————————
OperationalError Traceback (most recent call last)
C:\Anaconda3\lib\site-packages\sqlalchemy\engine\result.py in fetchall(self)
963 try:
–> 964 l = self.process_rows(self._fetchall_impl())
965 self._soft_close()
C:\Anaconda3\lib\site-packages\sqlalchemy\engine\result.py in _fetchall_impl(self)
914 try:
–> 915 return self.cursor.fetchall()
916 except AttributeError:
OperationalError:
The above exception was the direct cause of the following exception:
OperationalError Traceback (most recent call last)
in ()
5 engine = create_engine(‘sqlite:///edgar_idx.db’)
6 with engine.connect() as conn, conn.begin():
—-> 7 data = pandas.read_sql_table(‘idx’, conn)
8 data.to_stata(‘edgar_idx.dta’)
C:\Anaconda3\lib\site-packages\pandas\io\sql.py in read_sql_table(table_name, con, schema, index_col, coerce_float, parse_dates, columns, chunksize)
362 table = pandas_sql.read_table(
363 table_name, index_col=index_col, coerce_float=coerce_float,
–> 364 parse_dates=parse_dates, columns=columns, chunksize=chunksize)
365
366 if table is not None:
C:\Anaconda3\lib\site-packages\pandas\io\sql.py in read_table(self, table_name, index_col, coerce_float, parse_dates, columns, schema, chunksize)
1128 return table.read(coerce_float=coerce_float,
1129 parse_dates=parse_dates, columns=columns,
-> 1130 chunksize=chunksize)
1131
1132 @staticmethod
C:\Anaconda3\lib\site-packages\pandas\io\sql.py in read(self, coerce_float, parse_dates, columns, chunksize)
809 parse_dates=parse_dates)
810 else:
–> 811 data = result.fetchall()
812 self.frame = DataFrame.from_records(
813 data, columns=column_names, coerce_float=coerce_float)
C:\Anaconda3\lib\site-packages\sqlalchemy\engine\result.py in fetchall(self)
968 self.connection._handle_dbapi_exception(
969 e, None, None,
–> 970 self.cursor, self.context)
971
972 def fetchmany(self, size=None):
C:\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py in _handle_dbapi_exception(self, e, statement, parameters, cursor, context)
1339 util.raise_from_cause(
1340 sqlalchemy_exception,
-> 1341 exc_info
1342 )
1343 else:
C:\Anaconda3\lib\site-packages\sqlalchemy\util\compat.py in raise_from_cause(exception, exc_info)
200 exc_type, exc_value, exc_tb = exc_info
201 cause = exc_value if exc_value is not exception else None
–> 202 reraise(type(exception), exception, tb=exc_tb, cause=cause)
203
204 if py3k:
C:\Anaconda3\lib\site-packages\sqlalchemy\util\compat.py in reraise(tp, value, tb, cause)
183 value.__cause__ = cause
184 if value.__traceback__ is not tb:
–> 185 raise value.with_traceback(tb)
186 raise value
187
C:\Anaconda3\lib\site-packages\sqlalchemy\engine\result.py in fetchall(self)
962
963 try:
–> 964 l = self.process_rows(self._fetchall_impl())
965 self._soft_close()
966 return l
C:\Anaconda3\lib\site-packages\sqlalchemy\engine\result.py in _fetchall_impl(self)
913 def _fetchall_impl(self):
914 try:
–> 915 return self.cursor.fetchall()
916 except AttributeError:
917 return self._non_result([])
OperationalError: (sqlite3.OperationalError)
Hi Kwan,
I don’t know the exact cause based on the error log, but I guess it may be related to the relative and absolute path. Try this: first, check the file size of “edgar_idx.db” you got. If it is a sizeable file, the problem should exist in the conversion of Sqlite to Stata; second, use the absolute path in the conversion step: engine = create_engine(r’sqlite:///C:\path\to\edgar_idx.db’). Replace ‘C:\path\to\’ with your own absolute path.
Let me know if this works.
Kai
Hello Kai,
I spent some time reverse engineering and revising your script to suit my need. And your suggestion worked!! Thanks so much!!
I was so excited when python was running but unfortunately, at the end, it gave an operationalError. would it be due to the file size, the first possibility you mentioned above. Could you explain how to deal with big db files?
I didn’t try my codes on a Windows machine. The first step is to make sure you have generated a Sqlite database correctly. If the database has gathered the data as it is supposed to have, the size of the database should be about 2G. The first step helps you to decide at which stage the error occurs. If you can get the correct database file but just cannot export the data to another software, you can instead Google other exporting methods (there are tons of other methods).
Thanks for your reply, Kai. I have encountered errors a few times before running scripts written on Mac. It is more than a backslash or front slash issue but could not pinpoint the cause. I will figure it out and get back to you when I solve the issue. Thanks!
Thank you Kai for sharing this tutorial – after tweaking your code to better suit my environment it worked on the first try! I do, however, see something odd about the data returned. Although my sqlite db is about 2gb as you said it should be (it’s actually 1.7gb with another table in it), I find that I was only able to index ~16.4M filings when I ran the script today.
However, this article written back in 2014 (http://tinyurl.com/jupr3zy), indicates that there should’ve been at least 25M filings. I’ve seen various other articles that note there should be at least 20M.
Do you have any idea where the other ~5M to ~10M indexes are?
Thanks again for sharing!
You’re welcome. I have no idea about the other 5–10M indexes, but I believe the code is able to grab all indexes currently populated on SEC’s ftp server.
Here’s a reference to the SEC saying they offer “Free access to more than 21 million filings” (https://www.sec.gov/edgar/searchedgar/companysearch.html). Perhaps the 5–10M missing files are .htm, .html, or .xml children of some of the ~14M .txt files indexed.
Thanks for letting me know.
For people using that, ftp will shutdown dec ’16. It is publised on their site. https://www.sec.gov/edgar/searchedgar/ftpusers.htm
Does this return all the 10k for every company in the russell3000?
Would like to know what does it extract. I had data base programming using SQLite to extracts tweets into database. I would like to know what does this do in specific?
So I get the name of the company, ticker and path. Then how can I use the path to iterate over the text file?
UNfortunately, EDGAR access will be shutdown by the end of the year 🙁
https://www.sec.gov/edgar/searchedgar/ftpusers.htm
Regarding the SEC shutting down the FTP server on 12/31:
Replacing the FTP bits in the script with urllib2 and using https://www.sec.gov/Archives/edgar/full-index etc. as the base will return exactly the same data, and the rest of the script should work accordingly.
HDT, can you elaborate more? I’m in the process of refactoring for urllib2 and am having difficulties with method statements.
Code updated
Hi HDT,
I would also appreciate it if you could reply. Specifically if you could expand on what exactly entails replacing the FTP bits in the script with urllib2.
Thank you!
Code updated
Thanks for your code! I am using your code to successfully download proxy statement in txt format. Since the htm file is better formatted, I try to revise your code to download crawler.idx. But I am stuck in the code:
cur.executemany(‘INSERT INTO idx VALUES (?, ?, ?, ?, ?)’, records),
as I don’t how to set line.split for crawler.idx. Can you help to revise the code? Thanks in advance!
Please substitute the “records = ” line with the following several lines:
nameloc = lines[7].find(‘Company Name’)
typeloc = lines[7].find(‘Form Type’)
cikloc = lines[7].find(‘CIK’)
dateloc = lines[7].find(‘Date Filed’)
urlloc = lines[7].find(‘URL’)
records = [tuple([line[:typeloc].strip(), line[typeloc:cikloc].strip(), line[cikloc:dateloc].strip(), line[dateloc:urlloc].strip(), line[urlloc:].strip()]) for line in lines[9:]]
But note the URL you obtain is not the URL to an html version Form (see one example of such url https://www.sec.gov/Archives/edgar/data/859747/0001477932-16-007969-index.htm). You still have to go one level deeper, which I don’t know how to do yet.
Hi Kai
That is really useful code that can be reused in many scenarios. I am trying to create a database which contains detail information from 13F-HR. Now i did it by getting all raw txt/xml data based on your code 🙂
Now something frustrate(little bit..) me is … looks like there are all raw txt file for 13F until 2013 3q when xml available
Although it s quite ease to parse xml by python elementTree, i still struggling in how to deal with those txt data before 2013 3q.
Any constructive thoughts ?
Thank you very much
By the way, i am also in Ontario CA 🙂
My current knowledge about text data processing is also limited. To start with, you can learn regular expressions. That could resolve many questions. The high-level applications needs the knowledge of natural language processing using Python (or other programming languages). It’s technically intimidating.
Hi Ken, thanks for your post. I am just wondering whether you know if it is possible to automate an online download with Stata. This is the only program that I am comfortable using but never wrote such a code before.
I don’t know the exact answer, but because Stata is not a general-purpose programming language, it will be awkward to use Stata to do this.
Hi Kai, it is a very useful code. But I encountered below errors executing your exact code on both of my computers. Do you know the reason and do you know how I can solve it? I need to analyze the 10Ks to finish my homework. I appreciate your thoughts. Thank you very much!
—————————————————————————
MemoryError Traceback (most recent call last)
in ()
38 with engine.connect() as conn, conn.begin():
39 data = pandas.read_sql_table(‘idx’, conn)
—> 40 data.to_stata(‘edgar_idx.dta’)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in to_stata(self, fname, convert_dates, write_index, encoding, byteorder, time_stamp, data_label, variable_labels)
1487 write_index=write_index,
1488 variable_labels=variable_labels)
-> 1489 writer.write_file()
1490
1491 def to_feather(self, fname):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\stata.py in write_file(self)
2137 # write 5 zeros for expansion fields
2138 self._write(_pad_bytes(“”, 5))
-> 2139 self._prepare_data()
2140 self._write_data()
2141 self._write_value_labels()
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\stata.py in _prepare_data(self)
2272
2273 if has_strings:
-> 2274 self.data = np.fromiter(zip(*data_cols), dtype=dtype)
2275 else:
2276 self.data = data.to_records(index=False)
MemoryError: cannot allocate array memory
Hi Jing, I am not sure where the problem lies. Maybe you didn’t install required modules correctly, or maybe you machines didn’t have enough free RAM. Try changing start_year to a very recent year, say 2017, and see if the code can go through. If yes, then probably RAM is the culprit.
Hi Chen,
Thanks so much for your tutorial and sample codes. I really appreciate it. Your codes work perfectly fine except that python(v3.5 using pycharm) hangs when trying to download 2011QTR4 and 2017QTR3 master.idx. I manage to download all other master.idx by changing the year/quarter variables. I have tried on another computer and the same happened. Any idea what might be the problem? I can work through the Stata file you uploaded for now but I was thinking of extending the data in the future.
Thanks.
Hi Chang Jie. It just takes longer (e.g., around 11 minutes to download 2011QTR4 on my laptop) but will be able to go through eventually. I don’t know why though.
It is not the download that takes time. For some reason the requests library is having difficulty to auto-detect the encoding on theses 2 files. If detects ‘ISO-8859-1’ instead of the standard ‘ascii’.
I fixed it as such for the moment:
rep = requests.get(url)
rep.encoding = ‘ISO-8859-1’
lines = rep.text.splitlines()
Hi Kai,
Is there a way for this also to include the time the report was submitted in addition to the date?
Dear Kai Chen!
First of all – THANK you so much for doing this, you truly are amazing! I have been trying to run your code, but it doesn’t work due to a memoryloss error (due to my lack of RAM… I have 8GB). So, if I change the start date to 2015 it works flawlessly! However, I do need the data back to 1993. But I was thinking if there was a way to limit the dimension of this code, to only include the master files for all 10K types?
I’d love to hear back from you, since I am currently working on my master’s thesis!
Best regards,
Philip, you can download the data chunk by chunk, for example, first download data over 1993-1996, then download data over 1997-2000, and so on. All you need to do is to change the first three lines: set current quarter equal 4; for data 1993-1996, set start year equal 1993 and current year equal 1997; then for data 1997-2000, set start year equal 1997 and current year 2000, and so on; lastly, combine all data chunks.
Thank you so much, that solved it for me! 🙂
After checking out a number of the blog articles on your website, I really like your way of blogging.
I saved it to my bookmark website list and will be checking
back soon. Take a look at my website as well and tell me how you feel.
Many thanks Kai Chen,
great code for assessing the EDGAR database.
I only have a problem with partI of your code as it has some error with panda before writing to STATA .dta file – can you help me on that? Many thanks!
~ line 42, in
data = pandas.read_sql_table(‘idx’, conn)
~pandas\io\sql.py”, line 247, in read_sql_table
parse_dates=parse_dates, columns=columns, chunksize=chunksize)
~pandas\io\sql.py”, line 1003, in read_table
chunksize=chunksize)
~pandas\io\sql.py”, line 682, in read
data = result.fetchall()
~sqlalchemy\engine\result.py”, line 1137, in fetchall
self.cursor, self.context)
~sqlalchemy\engine\base.py”, line 1416, in _handle_dbapi_exception
util.reraise(*exc_info)
~sqlalchemy\util\compat.py”, line 249, in reraise
raise value
~sqlalchemy\engine\result.py”, line 1131, in fetchall
l = self.process_rows(self._fetchall_impl())
~sqlalchemy\engine\result.py”, line 1111, in process_rows
for row in rows]
~sqlalchemy\engine\result.py”, line 1111, in
for row in rows]
MemoryError
My best guess is the last module pandas changes the way it handles memory and as a result, it requires more memory to complete the conversion. I created the Stata dataset without any issue in March 2017, but now I can only create the Stata dataset successfully in smaller pieces. In other words, I had to create multiple Stata datasets, the first for 1993–2000, the second for 2001–2005, the third for 2006–2010, and so on. I’m curious how large memory pandas now requires to convert that single SQLite database to Stata in just one step. My 24M memory iMac failed.
Hi Kai Chen,
many thanks for your helpful post. Unfortunately your first code does not run through at my machine.
It would be great if you could upload a new version of the Stata dataset for download.
Many thanks!
Hi Kai Chen,
many thanks for your great post!
I was wondering whether the “Series ID”, which has to be published by Investment Companies, can also be downloaded via the Index file? Do you have any idea for the matching?
https://www.sec.gov/open/datasets-investment_company.html
Best regards and many thanks,
Dom
Hi Kai,
Thanks for sharing this. But the download links seems all expired. Can you check?
Best,
Caleb
Fixed.
Kai,
Thanks for posting this.
I’m about to embark on a mission to parse out XBRL files.
Could you please tell me if there’s a equivalent list for XBRL as you have pointed out above in master.idx?
Incidentally, I’m also Kai.
Regards,
Kai
The appendix of this working paper, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3376040, titled “The Relative Influences of Officers and Auditors on Annual Report Textual Disclosures”, provides a detailed discussion about how to extract footnotes from xbrl filings. As mentioned in the paper, the following link provides a sort of index for xbrl filings: https://www.sec.gov/dera/data/financial-statement-data-sets.html
Dear Kai,
Thank you very much for your blog, super helpful! Can you please clarify the following issue:
I am running the first part of the code that is coming after [Update on 2017-03-03] . Once I run it, I come across with the following error:
cur.executemany(‘INSERT INTO idx VALUES (?, ?, ?, ?, ?)’, records)
“ProgrammingError: Incorrect number of bindings supplied. The current statement uses 5, and there are 1 supplied.”
Do you know what is wrong here by any chance? Thank you for your help!
UnicodeEncodeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_2524\865941535.py in
1 with engine.connect() as conn, conn.begin():
2 data = pandas.read_sql_table(‘idx’, conn)
—-> 3 data.to_stata(‘edgar_idx.dta’, version=117)
~\anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
205 else:
206 kwargs[new_arg_name] = new_arg_value
–> 207 return func(*args, **kwargs)
208
209 return cast(F, wrapper)
~\anaconda3\lib\site-packages\pandas\core\frame.py in to_stata(self, path, convert_dates, write_index, byteorder, time_stamp, data_label, variable_labels, version, convert_strl, compression, storage_options, value_labels)
2636 kwargs[“version”] = version
2637
-> 2638 writer = statawriter(
2639 path,
2640 self,
~\anaconda3\lib\site-packages\pandas\io\stata.py in __init__(self, fname, data, convert_dates, write_index, byteorder, time_stamp, data_label, variable_labels, convert_strl, compression, storage_options, value_labels)
3200 self._convert_strl.extend(convert_strl)
3201
-> 3202 super().__init__(
3203 fname,
3204 data,
~\anaconda3\lib\site-packages\pandas\io\stata.py in __init__(self, fname, data, convert_dates, write_index, byteorder, time_stamp, data_label, variable_labels, compression, storage_options, value_labels)
2280 self._converted_names: dict[Hashable, str] = {}
2281 # attach nobs, nvars, data, varlist, typlist
-> 2282 self._prepare_pandas(data)
2283 self.storage_options = storage_options
2284
~\anaconda3\lib\site-packages\pandas\io\stata.py in _prepare_pandas(self, data)
2559
2560 # Verify object arrays are strings and encode to bytes
-> 2561 self._encode_strings()
2562
2563 self._set_formats_and_types(dtypes)
~\anaconda3\lib\site-packages\pandas\io\stata.py in _encode_strings(self)
2598 supported types.”””
2599 )
-> 2600 encoded = self.data[col].str.encode(self._encoding)
2601 # If larger than _max_string_length do nothing
2602 if (
~\anaconda3\lib\site-packages\pandas\core\strings\accessor.py in wrapper(self, *args, **kwargs)
123 )
124 raise TypeError(msg)
–> 125 return func(self, *args, **kwargs)
126
127 wrapper.__name__ = func_name
~\anaconda3\lib\site-packages\pandas\core\strings\accessor.py in encode(self, encoding, errors)
1871 encoded : Series/Index of objects
1872 “””
-> 1873 result = self._data.array._str_encode(encoding, errors)
1874 return self._wrap_result(result, returns_string=False)
1875
~\anaconda3\lib\site-packages\pandas\core\strings\object_array.py in _str_encode(self, encoding, errors)
221 def _str_encode(self, encoding, errors=”strict”):
222 f = lambda x: x.encode(encoding, errors=errors)
–> 223 return self._str_map(f, dtype=object)
224
225 def _str_find(self, sub, start=0, end=None):
~\anaconda3\lib\site-packages\pandas\core\strings\object_array.py in _str_map(self, f, na_value, dtype, convert)
69 map_convert = convert and not np.all(mask)
70 try:
—> 71 result = lib.map_infer_mask(arr, f, mask.view(np.uint8), map_convert)
72 except (TypeError, AttributeError) as err:
73 # Reraise the exception if callable
f
got wrong number of args.~\anaconda3\lib\site-packages\pandas\_libs\lib.pyx in pandas._libs.lib.map_infer_mask()
~\anaconda3\lib\site-packages\pandas\core\strings\object_array.py in (x)
220
221 def _str_encode(self, encoding, errors=”strict”):
–> 222 f = lambda x: x.encode(encoding, errors=errors)
223 return self._str_map(f, dtype=object)
224
UnicodeEncodeError: ‘latin-1’ codec can’t encode character ‘\u2019’ in position 43: ordinal not in range(256)
I am getting this error, can you help me what the issue is? as I am new in Python and I am using windows for the same.