Tuesday, August 5, 2014

JAVA large volume DNS query problem.

http://stackoverflow.com/questions/11955409/non-blocking-async-dns-resolving-in-java

Is there a clean way to resolve a DNS query (get IP by hostname) in Java asynchronously, in non-blocking way (i.e. state machine, not 1 query = 1 thread - I'd like to run tens of thousands queries simultaneously, but not run tens of thousands of threads)?
What I've found so far:
  • Standard InetAddress.getByName() implementation is blocking and looks like standard Java libraries lack any non-blocking implementations.
  • Resolving DNS in bulk question discusses similar problem, but the only solution found is multi-threaded approach (i.e. one thread working on only 1 query in every given moment of a time), which is not really scalable.
  • dnsjava library is also blocking only.
  • There are ancient non-blocking extensions to dnsjava dating from 2006, thus lacking any modern Java concurrency stuff such as Future paradigm usage and, alas, very limited queue-only implementation.
  • dnsjnio project is also an extension to dnsjava, but it also works in threaded model (i.e. 1 query = 1 thread).
  • asyncorg seems to be the best available solution I've found so far targeting this issue, but:
    • it's also from 2007 and looks abandoned
    • lacks almost any documentation/javadoc
    • uses lots of non-standard techniques such as Fun class
Any other ideas/implementations I've missed?
Clarification. I have a fairly large (several TB per day) amount of logs. Every log line has a host name that can be from pretty much anywhere around the internet and I need an IP address for that hostname for my further statistics calculations. Order of lines doesn't really matter, so, basically, my idea is to start 2 threads: first to iterate over lines:
  • Read a line, parse it, get the host name
  • Send a query to DNS server to resolve a given host name, don't block for answer
  • Store the line and DNS query socket handle in some buffer in memory
  • Go to the next line
And a second thread that will:
  • Wait for DNS server to answer any query (using epoll / kqueue like technique)
  • Read the answer, find which line it was for in a buffer
  • Write line with resolved IP to the output
  • Proceed to waiting for the next answer
A simple model implementation in Perl using AnyEvent shows me that my idea is generally correct and I can easily achieve speeds like 15-20K queries per second this way (naive blocking implementation gets like 2-3 queries per second - just the sake of comparison - so that's like 4 orders of magnitude difference). Now I need to implement the same in Java - and I'd like to skip rolling out my own DNS implementation ;)

No comments: