About the data gatheredIn our latest scan of Sep 2017:
- About 4 billion IP addresses were scanned for port 6667 (the entire IPv4 address space)
- 10,239 servers were found.
- On 8,995 IRC servers our data gatherer was able to fully connect. Other servers rejected our link due to password or IP restrictions.
- Most statistics require us to fully connect, however statistics such as CAP could be gathered on all servers.
- Numbers from above are after deduplication: because IRC servers may listen on several IP addresses, thousands of duplicates had to be filtered out.
- Bouncers like psyBNC and BitlBee are also filtered out
About data deduplication and network correlation
Data deduplication and network correlation are important things that need to be done in order to get reliable statistics. Unfortunately they are also time consuming tasks.
Server deduplicationWe need to deduplicate data: IRC servers may listen on multiple IP addresses, some even in totally different netblocks (yes, really). It is very important that these duplicates are filtered out. Example: 10 servers even listened on 500+ IP addresses in total!
On the other hand server names and network names are not unique, so these cannot be used as decissive factors. Example: There are 900+ servers with the network names 'ROXnet' and 'debian'. These are default network names in configuration files, hence the high number of matches. They are obviously not two big networks.
Fortunately we have come up with a reliable way to deduplicate server data using things like: server names, network names, software version, uptime information and more. If all these are the same then it's extremely likely it's the same server. We filtered out thousands of duplicates using this method.
Network correlationKnowing which servers belong to which network is not always easy to detect. Some statistics such as user and channel counts can only be published after this is done, otherwise they would not be reliable. There's no room for error: if you fail to detect servers belonging to the same network you will very quickly count users and channels twice or more. This will cause counts to be off by tens of thousands, which is not acceptable.
The 2016 scan contains insufficient data to do proper network correlation. Servers on the same network turned out to have different network name. Other distinct networks shared the same network names. This wouldn't be much of a problem if not also a significant amount of servers blocked /MAP and /LINKS.
In the 2017 scan we gathered additional data which should hopefully help us in these cases.
Networks running services were easy to correlate, hence the user counts on the Services page.
I want to see more data / more graphs!Follow us on Twitter if you want to stay informed. Send us a tweet if you have a suggestion or request.
Note that only after network correlation is done, reliable statistics can be published with user and channel counts. This is hard, so we'll see when that happens.
Important: we will not give away data that may identify individual networks, servers or users.
Can I get a copy of the data set?These data sets are currently available. If you use them, please credit ircstats.org.
- Servers data (JSON): server software in use, with for each version: number of servers deployed
- CAP data (JSON): CAP capabilities offered by servers. The "parent array" contains numbers and percentages of the tokens in use on all servers. The arrays under each parent contain numbers and percentages by server software in use (this only includes servers that allowed us to fully connect).
- TLS protocol data (JSON): SSL/TLS protocol offered on port 6697 (if any). The "parent array" contains numbers and percentages of the SSL/TLS protocols available on all servers. The arrays under each parent contain numbers and percentages by server software in use (this only includes servers that allowed us to fully connect).
- TLS certificate data (JSON): Validity of SSL certificates offered on port 6697. The "parent array" contains numbers and percentages of the validity statistics of all servers that offer SSL/TLS on 6697. The arrays under each parent contain numbers and percentages by server software in use for servers with SSL/TLS on 6697 (this only includes servers that allowed us to fully connect).
- Services data (JSON): services package installed, with for each version: number of networks and number of users on these networks.