Edgar Submission Format
Edgar url's are duplicated. Edgar file url's include the cik and accession number in them. Only the accession number uniquely identifies the file. There fore every Form 4 (ownership report) shows up as having 2 url's, one with the owner's cik (e.g. a Director or CEO) and another with the Company's cik.
Subfiles are redundant. If you look at the files in an edgar subdir, look at the file that ends in .txt (which will be the largest file) and inside that file you will see the other files included with <DOCUMENT> tags around it. This makes downloading all of the subfiles redundant.
Graphics (jpeg's, gif's etc.) are text encoded. These graphics files in their native state are already efficiently compressed (some more than others). Not much you can do about this since as stated above, this file is included in the "submission file" that ends with .txt.
Virtual File System
These file systems are all read only systems.
Squashfs Good all around vfs. It has random read access. Inode Blocks are compressed individually. Very good tradeoff of functionality. This file system can also be added to after inital creation. The winner on speed.
Cramfs Comparable to Squashfs but had filesystem size limitations that are a factor. File Styems size is limited to 256 MB so this is used mostly for embeded systems.
Cromfs Slower than squashfs but higher compression factors. The winner on size.
Kernal space or User space
Kernel space means that the vfs must be compiled into the operating system space. So immediately we are talking about a bigger commitment. Think and research before you go down this route. This has some clear advantages, speed being among them.
Linux has the FUSE system which stands for Filesystem in USErspace. By being in user space:
- There is no need to recompile the operating system kernel
- We can tweak the system to access the inside files of an edgar submission.
- We can allow for automatically downloading of submissions not yet retrieved from Edgar
Fuse has Python bindings as well as for your favorite language. These are only to write the vfs, all your apps can access the files as usual.
Openvest Archives
You can currently access squashfs compressed file images of SEC 10-K and 10-Q submissions from Amazon s3 archives. Contact me for details.
Install squashfs. Something like:
sudo yum install squashfs-tools
Download an annual archive.
cd /data/edgar curl -O http://edgar.openvest.s3.amazonaws.com/archive/05.sqsh
Create an empty mount point
mkdir archive/05
Mount the file.
mount archive/05.sqsh archive/05 -t squashfs -o loop
You can now access the arcive as you would any other file on the file system:
cat /data/edgar/archive/05/ef/63/000095012305004029/0000950123-05-004029.txt
The file subdir is year (annual archive) and then two subdirs which are the first two and then the second two chars of the m5 hexdigest of the accession number.
