Universal Internet Dataset Release & Timeline FAQ
Answers to commonly asked questions about the Censys Universal Internet Dataset.
What is the Universal Internet Dataset?
What is changing?
The Universal Internet Dataset will replace the existing ipv4
and ipv4_banner
datasets. The new single dataset combines the scan data that previously existed in two separate datasets. The schema, and some of the data encodings are also improved from the previous datasets.
When will the change happen?
The Universal Internet Dataset is available as of January 2021, and we plan to discontinue publishing legacy datasets on November 30, 2021.
How will the change be rolled out?
Both the new and old datasets will be available at the same time beginning January 2021 for a limited time. After existing enterprise customers have switched to using new dataset, we will discontinue publishing the old datasets for consumption via download and BigQuery. We will continue to provide historical downloadable files for the ipv4
and ipv4_banner
datasets, but new files will not be published after the switchover.
How has the schema changed?
At a high level, the new schema presents services primarily by their name, because the Censys scanner can now detect their protocol on any port we scan, not just the standard port number.
In the deprecated IPv4 dataset, for example, port 53
was equivalent to dns
. The Censys scanner only attempted to communicate with a DNS service when it saw port 53 open and it never looked for DNS on other port numbers. The JSON for a DNS service was structured like this:
{
"53" : {
"dns" : { … }
}
}
The Universal Internet DataSet reflects the fact that the Censys scanner can find DNS running on any open port. The new JSON response is structured like this:
{
"ip": "1.1.1.1",
"services": [
{
"dns": {
"serverType": "FORWARDING"
},
"port": 53,
"extendedServiceName": "DNS",
"observedAt": "2020-10-07T19:03:14.865850374Z",
"perspectiveId": "PERSPECTIVE_TELIA",
"serviceName": "DNS",
"sourceIp": "192.35.168.240",
"transportProtocol": "UDP"
}
]
}
The values for the perspective field is static indicate which Tier-1 ISP Censys peered with to observe the service.
Software and operating system (OS) identification by our next-gen scan engine follows the Common Platform Enumeration (CPE) specification, so the same software may appear differently in the new dataset and in the legacy datasets.
What’s missing from the new dataset that was in the old?
At this time, Censys tags are not present in the new dataset. This feature is on our roadmap and tags will be added later in 2021.
Metadata related to some headline vulnerabilities has also been removed. The fields heartbleed
, rsa_export
, dhe_export
, and dhe
are not included in the new dataset.
What additional enhancements are coming to the new dataset?
Enhanced IoT device identification, additional protocol identification (SIP/SIP-TLS, SCCP/SCCPS, TLS 1.3, HTTP/2, QUIC, IKEv1 and IKEv2, X11, PPTP, Cisco AnyConnect), CoAP, and IPv6 scan data will be added to the new dataset throughout 2021.
How can I access the new dataset?
The Universal Internet DataSet is available in BigQuery. The view name is universal_internet_dataset
.
The dataset can also be downloaded via API. To provide quick access, and serve as an example for your own production ETL pipeline, the example below is bash
script to download the files via jq
and curl
.
SERIES_DATE="<YYYYMMDD>"
CENSYS_API_USERNAME="<censys-api-username>"
CENSYS_API_SECRET="<censys-api-secret>"
ROOT="https://file-host-0.censys.io/snapshots/observations/$SERIES_DATE"
curl -L "$ROOT/manifest.json" -s \
-H "Content-Type: application/json; charset=utf-8" \
-u "$CENSYS_API_USERNAME:$CENSYS_API_SECRET" \
| jq -r '.files[] .filename' \
| sort \
| xargs -n1 -P2 -I '{}' curl -O --remote-name \
-u "$CENSYS_API_USERNAME:$CENSYS_API_SECRET" $ROOT/{}
The resulting files are in .avro format and you’ll need an Avro decoder. Here is a CLI decoder you can use, or you can decode directly in Python.
Will the Censys Search web UI be changing?
Yes! Censys Search 2.0, which is powered by the Universal Internet Dataset, has launched and is available to everyone at https://search.censys.io.
What about the API?
Other than the addition of the new dataset for download, the data API will remain unchanged. In the future, as the search platform adopts the dataset, the API will be updated to reflect those changes. We will also release a new API to access historical data.
Additional Information
Schema Format
{
"name": "CensysAvro",
"type": "record",
"fields": [
{
"name": "host_identifier",
"doc": "Structured data about a host",
"type": [
"null",
{
"name": "HostIdentifier",
"type": "record",
"fields": [
{
"name": "ipv4",
"type": ["null", "string"],
"default": null
},
{
"name": "ipv6",
"type": ["null", "string"],
"default": null
}
],
"namespace": "io.censys.avro.enterprise_data.darkly"
}
],
"default": null
},
{
"name": "services",
"doc": "Services running on this host",
"type": {
"type": "array",
"items": {
"name": "Service",
"type": "record",
"fields": [
{
"name": "port",
"type": "long",
"default": 0
},
{
"name": "transport",
"doc": "Transport protocol used to contact this service.",
"type": ["null", "string"],
"default": null
},
{
"name": "service_name",
"doc": "Name of the service running.",
"type": ["null", "string"],
"default": null
},
{
"name": "perspective",
"doc": "Perspective from which Censys saw this service.",
"type": ["null", "string"],
"default": null
},
{
"name": "observed_at",
"doc": "A UTC timestamp denoting when this service was last observed.",
"type": ["null", "string"],
"default": null
},
{
"name": "tls",
"doc": "Information about tls handshake done with this host.",
"type": [
"null",
{
"name": "Tls",
"type": "record",
"fields": [
{
"name": "version_selected",
"type": "string",
"default": ""
},
{
"name": "cipher_selected",
"type": "string",
"default": ""
},
{
"name": "certificates",
"type": [
"null",
{
"name": "Certificates",
"type": "record",
"fields": [
{
"name": "leaf_fp_sha_256",
"type": "bytes",
"default": ""
},
{
"name": "chain_fps_sha_256",
"type": {
"type": "array",
"items": "bytes"
},
"default": []
},
{
"name": "leaf_data",
"type": [
"null",
{
"name": "LeafCertificateData",
"type": "record",
"fields": [
{
"name": "names",
"type": {
"type": "array",
"items": "string"
},
"default": []
},
{
"name": "subject_dn",
"type": "string",
"default": ""
},
{
"name": "issuer_dn",
"type": "string",
"default": ""
},
{
"name": "pubkey_bit_size",
"type": "int",
"default": 0
},
{
"name": "pubkey_algorithm",
"type": "string",
"default": ""
},
{
"name": "tbs_fingerprint",
"type": "string",
"default": ""
}
],
"namespace": "io.censys.avro.host.services"
}
],
"default": null
}
],
"namespace": "io.censys.avro.host.services"
}
],
"default": null
}
],
"namespace": "io.censys.avro.host.services"
}
],
"default": null
},
{
"name": "banner",
"doc": "A byte array representing raw response from a host if applicable.",
"type": ["null", "bytes"],
"default": null
},
{
"name": "software",
"doc": "List of software products detected on this service, include CPE Uniform Resource Identifiers.",
"type": {
"type": "array",
"items": {
"name": "Software",
"type": "record",
"fields": [
{
"name": "uniform_resource_identifier",
"doc": "CPE uri format as defined here: https://nvlpubs.nist.gov/nistpubs/Legacy/IR/nistir7695.pdf",
"type": ["null", "string"],
"default": null
},
{
"name": "part",
"doc": "Defines the class of this software, a for application, o for operating system, h for hardware devices.",
"type": ["null", "string"],
"default": null
},
{
"name": "vendor",
"doc": "Identifies the person or organization that manufactured or created the product.",
"type": ["null", "string"],
"default": null
},
{
"name": "product",
"doc": "Identifies the most common and recognizable title or name of the product.",
"type": ["null", "string"],
"default": null
},
{
"name": "version",
"doc": "Vendor-Specific alphanumeric strings characterizing the particular release version of the product.",
"type": ["null", "string"],
"default": null
},
{
"name": "update",
"doc": "Vendor-Specific alphanumeric strings characterizing the particular update, service pack, or point release of the product.",
"type": ["null", "string"],
"default": null
},
{
"name": "sw_edition",
"doc": "Characterizes how the product is tailored to a particular market or class of end users.",
"type": ["null", "string"],
"default": null
},
{
"name": "target_sw",
"doc": "Characterizes the software computing environment within which the product operates.",
"type": ["null", "string"],
"default": null
},
{
"name": "target_hw",
"doc": "Characterizes the instruction set architecture (e.g., x86) on which the product being described. Bytecode-intermediate languages, such as Java bytecode for the Java Virtual Machine or Microsoft Common Intermediate Language for the Common Language Runtime virtual machine, are be considered instruction set architectures.",
"type": ["null", "string"],
"default": null
},
{
"name": "language",
"doc": "Valid language tag as defined by [RFC5646], and should be used to define the language supported in the user interface of the product being described.",
"type": ["null", "string"],
"default": null
},
{
"name": "component_uniform_resource_identifiers",
"doc": "URIs of software components related to the identified software.",
"type": {
"type": "array",
"items": "string"
},
"default": []
},
{
"name": "other",
"doc": "Other attributes describing the identified software",
"type": {
"type": "array",
"items": {
"name": "OtherEntry",
"type": "record",
"fields": [
{
"name": "key",
"type": "string",
"default": ""
},
{
"name": "value",
"type": "string",
"default": ""
}
],
"namespace": "io.censys.avro.host.services.software"
}
},
"default": []
},
{
"name": "edition",
"doc": "Captures edition-related terms applied by the vendor to the product, deprecated in CPE 2.3, but kept for backwards compatibility with CPE 2.2.",
"type": ["null", "string"],
"default": null
}
],
"namespace": "io.censys.avro.host.services"
}
},
"default": []
},
{
"name": "source_ip",
"doc": "The Source IP address that censys saw the service from.",
"type": "string",
"default": ""
},
{
"name": "service",
"type": [
"null",
"BannerGrab",
"Http",
"Vnc",
"Rdp",
"Ssh",
"Mysql",
"Ipmi",
"Amqp",
"Elasticsearch",
"Kubernetes",
"Memcached",
"Mssql",
"Oracle",
"Prometheus",
"Redis",
"Snmp",
"Postgres",
"Mongodb",
"Bacnet",
"Dnp3",
"Dns",
"Ftp",
"Imap",
"Ipp",
"Modbus",
"Mqtt",
"Ntp",
"PcAnywhere",
"Pop3",
"S7",
"Smb",
"Smtp",
"Telnet",
"Fox",
"Openvpn",
"Coap"
]
}
],
"namespace": "io.censys.avro.enterprise_data.darkly"
}
},
"default": []
}
],
"namespace": "io.censys"
}
Comments
0 comments
Article is closed for comments.