BigQuery Introduction
Censys Enterprise customers can perform complex queries against Censys data via a collection of BigQuery tables. Google BigQuery is Google's public data warehouse service that allows developers to interact with arbitrarily large datasets in real time using SQL and Javascript.
BigQuery offers a web UI, command-line tool, and REST API for querying, as well as Dataproc (Google's Spark and Hadoop offering). There are also a variety of third-party tools—like Looker and Tableau—that you can use to interact with Censys BigQuery data.
Censys Datasets
Current Datasets:
-
Universal Internet Dataset: A single table (
censys-io.universal_internet_dataset.current
) that contains structured/banner data from over 3500 ports on IPv4, IPv6 hosts and virtual hosts. -
Certificates: Two certificates tables (
certificates_v2.certificates_all
) that contain all unique X.509 certificates found during scan and by synchronizing with publicly known CT servers.
Historical Datasets:
-
Domains (deprecated) : A daily snapshot (
domain_public.current
) of the current state of the Alexa Top Million Domains collection. -
IPv4 (deprecated): A daily snapshot (
ipv4_public.current
) of the current state of known IPv4 hosts and how public protocols are configured. -
IPv4 Banners (deprecated): A daily snapshot (
ipv4_banners_public.current
) of the (banner) results of our continuous light-weight scanning on 2029 of the most popular ports.
Writing Queries
BigQuery's SQL dialect is compliant with the SQL 2011 standard, and you can query tables once you have added the Censys datasets.
Examples
Discover the distribution of services with Coinhive miner indicators of compromise across ports in today's Internet snapshot:
SELECT
port,
count(*)
FROM
`censys-io.universal_internet_dataset.current`,
UNNEST(services)
WHERE
service_name = 'HTTP'
AND REGEXP_CONTAINS(SAFE_CONVERT_BYTES_TO_STRING(http.response.body), r'(?i)coinhive.min.js')
GROUP BY 1
ORDER BY 2 DESC
Calculate the breakdown of cryptographic keys for certificates:
SELECT COUNT(*), parsed.subject_key_info.key_algorithm.name
FROM `censys-io.certificates_public.certificates`
GROUP BY parsed.subject_key_info.key_algorithm.name
Find websites using a specific cipher suite:
SELECT domain, alexa_rank
FROM `censys-io.domain_public.current`
WHERE p443.https.tls.cipher_suite.name = 'TLS_RSA_WITH_AES_256_CBC_SHA'
Standard SQL Only
You need to use Standard SQL to query the Censys datasets, which is now the default for Google BigQuery. If you have issues running queries, confirm that your query is using Standard SQL by opening the query editor and selecting the More dropdown button.
In the drop-down menu, select "Query Settings" to open a window similar to the one below. Confirm that any references to Legacy SQL are unselected.
Another option is to include the following command at the top of your SQL statement in the web interface as shown below:
#standardsql
<SQL GOES HERE>
For more information on constructing SQL statements, check out the BigQuery SQL Reference.
Helpful Hint
Censys datasets are extremely wide, often containing thousands of columns, so seemingly simple queries (e.g., finding a single record about a host: SELECT * FROM censys-io.universal_internet_dataset.current WHERE host_identifer.ipv4 = '8.8.8.8'
will process several terabytes of data. This can be costly ($5-10 for the single query). Reduce these costs by only selecting the columns you need or by combining several small queries into a single larger query.
Query Limitations
The data Censys directly exposes to users is accessible through BigQuery Views, which are subject to a few limitations:
- Standard SQL Only. BigQuery does not allow mixing standard and legacy SQL. This means that Censys views can only be queried using Standard SQL.
- No Direct Export Jobs. The results of queries against Censys views cannot be directly exported to a file on Google Cloud Storage. If you want to export results, you'll need to save the results of your query to a temporary table and export that to an external file.
- No Wildcard Queries. You cannot reference a view in a wildcard table query. Instead, you'll need to explicitly reference the tables you want to query.
If these limitations are causing significant problems, please reach out to our team at support@censys.io.
Connecting to BigQuery in a Script
There are two methods of authenticating against Google BigQuery in a non-interactive environment:
- Your User Account. You can use OAUTH if you're running a script personally.
- Service Account. The service accounts you create in your Google Cloud Project will not automatically be granted access to the Censys BigQuery datasets: access is only granted to your user account and this does not propagate to the cloud project. If you'd like to have a service account added, please reach out to us at support@censys.io.
Comments
0 comments
Article is closed for comments.