Use Python to download TXT-format SEC filings on EDGAR (Part I)

We know that SEC makes company filings (e.g.,10-Ks, 10-Qs and 8-Ks) publicly available on EDGAR. The web search interface is convenient, but we may need to bulk download raw text filings. SEC provides an anonymous EDGAR FTP server to access raw text filings (Update on March 3, 2017: SEC closed the FTP server permanently on December 30, 2016. So the description about the FTP server might be not applicable thereafter. But the basic idea about the URLs to raw text filings remain unchanged). Usually, if we know the path or URL to a file on an FTP server, we can easily use an Internet browser or an FTP software to connect to the server and download the file. For example, if we navigate a bit on the EDGAR FTP server, we can find the path to the file “master.idx” as follows:

ftp://ftp.sec.gov/edgar/full-index/2015/QTR4/master.idx

Copy the path into an Internet browser or an FTP software, we can download the file directly.

In the above example, we can find the path to “master.idx” by navigating on the EDGAR FTP server. But we cannot find any path to any raw text filing. In other words, paths to raw text filings are not visible by simply looking into the EDGAR FTP server. SEC purposely hides paths to raw text filings to reduce server load and avoid data abuse.

In order to download SEC filings on EDGAR, we have to:

Find paths to raw text filings;

Select what we want and bulk download raw text filings from the EDGAR FTP server using paths we have obtained in the first step.

This post describes the first step, and I elaborate the second step in another post.

SEC stores all path information in index files. See technical details here. Let’s take a snap shot of an index file:

The last field on a line in the main body of the index file shows the path to a real raw text filing. What we have to do in the first step is to download and parse all index files and write the content into a database. Then in the second step, we can execute any query into the database (e.g., select certain form type or certain period of time) and download raw text filings using selected paths.

I write the following Python program to execute the first step. This program borrows from Edouard Swiac’s Python module “python-edgar” (version: 1.0). Please see his package information page here.

Please note: my program stores all paths in an SQLite database. I personally like the lightweight database product very much. The last few lines of my program transfer data from the SQLite database to an Stata dataset for users who are not familiar with SQLite. To do so, I use two Python modules: pandas and sqlalchemy which you have to install using pip command on your own. Please google documentations of SQLite, Pandas, and SQLAchemy if you have installation problems. I am using Python 3.x in all my Python posts.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

importdatetime

importftplib

importsqlite3

importtempfile

importzipfile

# Generate the list of quarterly zip files archived in EDGAR since

# start_year (earliest: 1993) until the most recent quarter

current_year=datetime.date.today().year

current_quarter=(datetime.date.today().month-1)//3+1

start_year=1993

years=list(range(start_year,current_year))

quarters=['QTR1','QTR2','QTR3','QTR4']

history=[(y,q)foryinyears forqinquarters]

foriinrange(1,current_quarter):

history.append((current_year,'QTR%d'%i))

quarterly_files=['edgar/full-index/%d/%s/master.zip'%(x[0],x[1])forx

inhistory]

quarterly_files.sort()

# Generate the list of daily index files archived in EDGAR for the most

Update on March 3, 2017: SEC closed the FTP server permanently on December 30, 2016 and started to use a more secure transmission protocol—https. Since then I have received several requests to update the script. Here it is the new script for Part I.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

# Generate the list of index files archived in EDGAR since start_year (earliest: 1993) until the most recent quarter

I find two articles explain how to use R and Perl to achieve the same functionality. I include the links (R or Perl) for users who are more comfortable with R or Perl. Also, the technical details may be too boring to most people. So I provide the Stata dataset for download (size: 3.6GB) which includes all index files from 1993 Q1 to March 3, 2017.

Hi Kai, I have found your page while checking for ways to download SEC EDGAR files. I am very new to this topic and was wondering if you could help me with some questions I have regarding the use of Python for such a massive download? I would be happy to send you an email, but please let me know if you prefer me to ask you as a comment related to your post instead. Thanks!

Thanks for the data, but I have a hard time to extract this file (around 395.95MB). The upzipped file size seems to be about 1.7GB, rather than 19GB. And I also have an error when the extracting process is close to 100%. Any thoughts?

Thank you! I just saw the reply. (I thought I would receive an email from you when I got a reply:p)
I found another website also includes the Edgar masterfile (http://www.wrds.us/index.php/repository/view/25)
When I compare that one with yours, basically the two are quite similar, almost the same amount of observations. One thing I notice is that yours have ID, and the length of your variables is 255. Therefore their size is 1.9 GB, yours is 19 GB. Good job anyway!

This is great, but when i run the code I get “sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.”

Excellent work. We are a boutique Investment Bank based in Chicago and Dallas. Specializing in marketing private offerings publicly via the JOBS Act/SEC guidelines. http://www.cassonmediagroup.com

One of the tools we are working on is providing free, user-friendly access to all Private Offering filed (Form D) plus all offerings that have been posted on the “Equity Crowdfunding” sites (that have not yet filed): a manual aggregation project.

Our thinking at this point is to not only extract via FTP, etc. all the new Form D filings every day plus attached docs. But to build a “web widget” app that can be embedded on any website (similar to stock quote widgets) giving the user a fast and painless way to access/search private offerings.

Also, we have a mobile app (iPad, iPhone, IOS, Android) that would access the data.
At no point do plan to monetize access, all would be Open Source and free. We would ask name and email to access to use the app.

We would deeply appreciate any assistance that you could provide. Paid consulting gig, a 30-minute phone conference, design review, anything!

Hi Kai Chen,
Great post. I was wondering where I can find the part II of this post as I want only 10-k and 10-q forms. Secondly, I’m able to create the ‘idx’ table, but somehow it’s not getting inserted with the values. Is it because, I always end up with an error?

Hi Kai, Thanks for the reply. I have Mysql DB and I think the format of “records” isn’t the one which mysql excepts. So, I downloaded your .dta file and converted it into CSV format and will import the csv files into the table. But there’s one issue I’m having right now, the url or path we have in the table is in txt format however I want it in .htm format. I’m working on it.

I read your post on Stackoverflow. In my codes I retrieve “master.idx” because the goal is to download and process raw text filings for further textual analysis. To achieve your goal, you can try to retrieve another type of index file: “crawler.idx”. See the example ftp://ftp.sec.gov/edgar/daily-index/crawler.20160201.idx. That path leads to the html version. You still need some tweak (go one level deeper) but that should be doable.

Thanks so much for your posting. It helps me a lot. I was thinking about doing the same thing but do not have enough skills to write a python scripts like this. I appreciate your effort.
I emulated the script on Python 3.5.2 and was able to create a sqlite db file. But I get erros when I try to export it as dta file. So I just also downloaded the dta file but want to know what is wrong. Below is the error message.

I don’t know the exact cause based on the error log, but I guess it may be related to the relative and absolute path. Try this: first, check the file size of “edgar_idx.db” you got. If it is a sizeable file, the problem should exist in the conversion of Sqlite to Stata; second, use the absolute path in the conversion step: engine = create_engine(r’sqlite:///C:\path\to\edgar_idx.db’). Replace ‘C:\path\to\’ with your own absolute path.

I was so excited when python was running but unfortunately, at the end, it gave an operationalError. would it be due to the file size, the first possibility you mentioned above. Could you explain how to deal with big db files?

I didn’t try my codes on a Windows machine. The first step is to make sure you have generated a Sqlite database correctly. If the database has gathered the data as it is supposed to have, the size of the database should be about 2G. The first step helps you to decide at which stage the error occurs. If you can get the correct database file but just cannot export the data to another software, you can instead Google other exporting methods (there are tons of other methods).

Thanks for your reply, Kai. I have encountered errors a few times before running scripts written on Mac. It is more than a backslash or front slash issue but could not pinpoint the cause. I will figure it out and get back to you when I solve the issue. Thanks!

Thank you Kai for sharing this tutorial – after tweaking your code to better suit my environment it worked on the first try! I do, however, see something odd about the data returned. Although my sqlite db is about 2gb as you said it should be (it’s actually 1.7gb with another table in it), I find that I was only able to index ~16.4M filings when I ran the script today.

However, this article written back in 2014 (http://tinyurl.com/jupr3zy), indicates that there should’ve been at least 25M filings. I’ve seen various other articles that note there should be at least 20M.

Thanks for your code! I am using your code to successfully download proxy statement in txt format. Since the htm file is better formatted, I try to revise your code to download crawler.idx. But I am stuck in the code:
cur.executemany(‘INSERT INTO idx VALUES (?, ?, ?, ?, ?)’, records),
as I don’t how to set line.split for crawler.idx. Can you help to revise the code? Thanks in advance!

That is really useful code that can be reused in many scenarios. I am trying to create a database which contains detail information from 13F-HR. Now i did it by getting all raw txt/xml data based on your code 🙂

Now something frustrate(little bit..) me is … looks like there are all raw txt file for 13F until 2013 3q when xml available

Although it s quite ease to parse xml by python elementTree, i still struggling in how to deal with those txt data before 2013 3q.

My current knowledge about text data processing is also limited. To start with, you can learn regular expressions. That could resolve many questions. The high-level applications needs the knowledge of natural language processing using Python (or other programming languages). It’s technically intimidating.

Hi Ken, thanks for your post. I am just wondering whether you know if it is possible to automate an online download with Stata. This is the only program that I am comfortable using but never wrote such a code before.

Hi Kai, it is a very useful code. But I encountered below errors executing your exact code on both of my computers. Do you know the reason and do you know how I can solve it? I need to analyze the 10Ks to finish my homework. I appreciate your thoughts. Thank you very much!
—————————————————————————
MemoryError Traceback (most recent call last)
in ()
38 with engine.connect() as conn, conn.begin():
39 data = pandas.read_sql_table(‘idx’, conn)
—> 40 data.to_stata(‘edgar_idx.dta’)

Hi Jing, I am not sure where the problem lies. Maybe you didn’t install required modules correctly, or maybe you machines didn’t have enough free RAM. Try changing start_year to a very recent year, say 2017, and see if the code can go through. If yes, then probably RAM is the culprit.

Thanks so much for your tutorial and sample codes. I really appreciate it. Your codes work perfectly fine except that python(v3.5 using pycharm) hangs when trying to download 2011QTR4 and 2017QTR3 master.idx. I manage to download all other master.idx by changing the year/quarter variables. I have tried on another computer and the same happened. Any idea what might be the problem? I can work through the Stata file you uploaded for now but I was thinking of extending the data in the future.