New Data Format
In August 2018, we began publishing datasets in GZIP as opposed to LZ4. The new files are composed of a large number of small GZIPed files rather than a single, large LZ4-compressed file.
For those with the proper access, historical scans prior to this end date will still be provided in the LZ4 format. You can view the format of the scan by looking here:
Change is Good
At the end of the day, we made this change with our customers in mind. Censys scan data can be upwards to 1TB in size. Ultimately, this meant that if the download failed at some point in the process, it failed completely. By offering the scans as multiple files you can ingest part of the files in one session and then pick back up where you left off if an error does occur.
The other main reasoning behind our decision is the simple fact that some users may not have enough space on their machine to store 1TB worth of data. By providing the multiple smaller files, users can now split the storage across multiple machines. We'll touch more on this process in a bit.
Downloading the Files
When downloading the old LZ4 files many users simply clicked the link provided and downloaded the file within their web browser.
With the new GZIP files, we provide a command that can be used to download the files. An example is given below. *Note, that you'll need your individual API ID and API Secret. You can access this by navigating to: https://censys.io/account/api
CENSYS_API_USERNAME="YOUR API ID" \
CENSYS_API_SECRET="YOUR API SECRET" ; \
curl -L "https://censys.io/api/v1/data/ipv4_2018/20180917T0409" -s \
-H "Content-Type: application/json; charset=utf-8" \
-u "$CENSYS_API_USERNAME:$CENSYS_API_SECRET" \
| jq -r '.files .download_path' | sort | xargs -n1 -P2 wget
If the machine that you are working from doesn't have enough disk space for the entire scan you may want to add a delimiter in your download command. Using this method you can follow a download, transfer, delete, repeat process. We’ll keep a 5-day rolling window of active download links that should provide enough time to ingest the entire set of files.
If you have any other questions please reach out to our team at firstname.lastname@example.org.