What's Next: August 2014

Thursday, August 7, 2014

emulating a browser in python with mechanize

Emulating a Browser in Python with mechanize

http://stockrt.github.io/p/emulating-a-browser-in-python-with-mechanize/

Posted by Rogério Carvalho Schneider

16 Aug 2009

It is always useful to know how to quickly instantiate a browser in the command line or inside your python scripts.
Every time I need to automate any task regarding web systems I do use this recipe to emulate a browser in python:

import mechanize
import cookielib

# Browser
br = mechanize.Browser()

# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)

# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

# Want debugging messages?
#br.set_debug_http(True)
#br.set_debug_redirects(True)
#br.set_debug_responses(True)

# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

Now you have this br object, this is your browser instance. With this its possible to open a page, to inspect or to interact with:

# Open some site, let's pick a random one, the first that pops in mind:
r = br.open('http://google.com')
html = r.read()

# Show the source
print html
# or
print br.response().read()

# Show the html title
print br.title()

# Show the response headers
print r.info()
# or
print br.response().info()

# Show the available forms
for f in br.forms():
    print f

# Select the first (index zero) form
br.select_form(nr=0)

# Let's search
br.form['q']='weekend codes'
br.submit()
print br.response().read()

# Looking at some results in link format
for l in br.links(url_regex='stockrt'):
    print l

If you are about to access a password protected site (http basic auth):

# If the protected site didn't receive the authentication data you would
# end up with a 410 error in your face
br.add_password('http://safe-site.domain', 'username', 'password')
br.open('http://safe-site.domain')

Thanks to the Cookie Jar we’ve added before, you do not have to bother about session handling for authenticated sites, as in when you are accessing a service that requires a POST (form submit) of user and password. Usually they ask your browser to store a session cookie and expects your browser to contain that same cookie when re-accessing the page. All this, storing and re-sending the session cookies, is done by the Cookie Jar, neat!
You can also manage with browsing history:

# Testing presence of link (if the link is not found you would have to
# handle a LinkNotFoundError exception)
br.find_link(text='Weekend codes')

# Actually clicking the link
req = br.click_link(text='Weekend codes')
br.open(req)
print br.response().read()
print br.geturl()

# Back
br.back()
print br.response().read()
print br.geturl()

Downloading a file:

# Download
f = br.retrieve('http://www.google.com.br/intl/pt-BR_br/images/logo.gif')[0]
print f
fh = open(f)

Setting a proxy for your http navigation:

# Proxy and user/password
br.set_proxies({"http": "joe:password@myproxy.example.com:3128"})

# Proxy
br.set_proxies({"http": "myproxy.example.com:3128"})
# Proxy password
br.add_proxy_password("joe", "password")

But, if you just want to quickly open an webpage, without the fancy features above, just issue that:

# Simple open?
import urllib2
print urllib2.urlopen('http://stockrt.github.com').read()

# With password?
import urllib
opener = urllib.FancyURLopener()
print opener.open('http://user:password@stockrt.github.com').read()

See more in Python mechanize site , mechanize docs and ClientForm docs.
Also, I have made this post to elucidate how to handle html forms and sessions with python mechanize and BeautifulSoup

Tuesday, August 5, 2014

JAVA large volume DNS query problem.

http://stackoverflow.com/questions/11955409/non-blocking-async-dns-resolving-in-java

Is there a clean way to resolve a DNS query (get IP by hostname) in Java asynchronously, in non-blocking way (i.e. state machine, not 1 query = 1 thread - I'd like to run tens of thousands queries simultaneously, but not run tens of thousands of threads)?
What I've found so far:

Standard InetAddress.getByName() implementation is blocking and looks like standard Java libraries lack any non-blocking implementations.
Resolving DNS in bulk question discusses similar problem, but the only solution found is multi-threaded approach (i.e. one thread working on only 1 query in every given moment of a time), which is not really scalable.
dnsjava library is also blocking only.
There are ancient non-blocking extensions to dnsjava dating from 2006, thus lacking any modern Java concurrency stuff such as Future paradigm usage and, alas, very limited queue-only implementation.
dnsjnio project is also an extension to dnsjava, but it also works in threaded model (i.e. 1 query = 1 thread).
asyncorg seems to be the best available solution I've found so far targeting this issue, but:
- it's also from 2007 and looks abandoned
- lacks almost any documentation/javadoc
- uses lots of non-standard techniques such as Fun class

Any other ideas/implementations I've missed?
Clarification. I have a fairly large (several TB per day) amount of logs. Every log line has a host name that can be from pretty much anywhere around the internet and I need an IP address for that hostname for my further statistics calculations. Order of lines doesn't really matter, so, basically, my idea is to start 2 threads: first to iterate over lines:

Read a line, parse it, get the host name
Send a query to DNS server to resolve a given host name, don't block for answer
Store the line and DNS query socket handle in some buffer in memory
Go to the next line

And a second thread that will:

Wait for DNS server to answer any query (using epoll / kqueue like technique)
Read the answer, find which line it was for in a buffer
Write line with resolved IP to the output
Proceed to waiting for the next answer

A simple model implementation in Perl using AnyEvent shows me that my idea is generally correct and I can easily achieve speeds like 15-20K queries per second this way (naive blocking implementation gets like 2-3 queries per second - just the sake of comparison - so that's like 4 orders of magnitude difference). Now I need to implement the same in Java - and I'd like to skip rolling out my own DNS implementation ;)

Overview of RAMFS and TMPFS on Linux--by Ramesh Natarajan on November 6, 2008

Using ramfs or tmpfs you can allocate part of the physical memory to be used as a partition. You can mount this partition and start writing and reading files like a hard disk partition. Since you’ll be reading and writing to the RAM, it will be faster.

When a vital process becomes drastically slow because of disk writes, you can choose either ramfs or tmpfs file systems for writing files to the RAM.

Both tmpfs and ramfs mount will give you the power of fast reading and writing files from and to the primary memory. When you test this on a small file, you may not see a huge difference. You’ll notice the difference only when you write large amount of data to a file with some other processing overhead such as network.

1. How to mount Tmpfs

# mkdir -p /mnt/tmp

# mount -t tmpfs -o size=20m tmpfs /mnt/tmp

The last line in the following df -k shows the above mounted /mnt/tmp tmpfs file system.

# df -k
Filesystem      1K-blocks  Used     Available Use%  Mounted on
/dev/sda2       32705400   5002488  26041576  17%   /
/dev/sda1       194442     18567    165836    11%   /boot
tmpfs           517320     0        517320    0%    /dev/shm
tmpfs           20480      0        20480     0%    /mnt/tmp

2. How to mount Ramfs

# mkdir -p /mnt/ram

# mount -t ramfs -o size=20m ramfs /mnt/ram

The last line in the following mount command shows the above mounted /mnt/ram ramfs file system.

# mount
/dev/sda2 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sda1 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
fusectl on /sys/fs/fuse/connections type fusectl (rw)
tmpfs on /mnt/tmp type tmpfs (rw,size=20m)
ramfs on /mnt/ram type ramfs (rw,size=20m)

You can mount ramfs and tmpfs during boot time by adding an entry to the /etc/fstab.

3. Ramfs vs Tmpfs

Primarily both ramfs and tmpfs does the same thing with few minor differences.

Ramfs will grow dynamically. So, you need control the process that writes the data to make sure ramfs doesn’t go above the available RAM size in the system. Let us say you have 2GB of RAM on your system and created a 1 GB ramfs and mounted as /tmp/ram. When the total size of the /tmp/ram crosses 1GB, you can still write data to it. System will not stop you from writing data more than 1GB. However, when it goes above total RAM size of 2GB, the system may hang, as there is no place in the RAM to keep the data.
Tmpfs will not grow dynamically. It would not allow you to write more than the size you’ve specified while mounting the tmpfs. So, you don’t need to worry about controlling the process that writes the data to make sure tmpfs doesn’t go above the specified limit. It may give errors similar to “No space left on device”.
Tmpfs uses swap.
Ramfs does not use swap.

4. Disadvantages of Ramfs and Tmpfs

Since both ramfs and tmpfs is writing to the system RAM, it would get deleted once the system gets rebooted, or crashed. So, you should write a process to pick up the data from ramfs/tmpfs to disk in periodic intervals. You can also write a process to write down the data from ramfs/tmpfs to disk while the system is shutting down. But, this will not help you in the time of system crash.

Table: Comparison of ramfs and tmpfs
Experimentation	Tmpfs	Ramfs
Fill maximum space and continue writing	Will display error	Will continue writing
Fixed Size	Yes	No
Uses Swap	Yes	No
Volatile Storage	Yes	Yes

If you want your process to write faster, opting for tmpfs is a better choice with precautions about the system crash.

This article was written by SathiyaMoorthy. He is working at bksystems, interested in writing articles and contribute to open source in his leisure time. The Geek Stuff welcomes your tips and guest articles.

Sunday, August 3, 2014

linux, process go to "S" status for a while. socket programming problem.

You could try and trace the system calls and signals of one of the concerned processes.
Maybe you'll find a hint on what's goung on.

strace -p pid

where pid is the process id as found in the second column of "ps -ef".

You could add the "-f" flag to trace forked child processes as well:

strace -fp pid

Checking a strace -fp pid as suggested, I'm getting a huge amount of the following messages until I interrupt the command:

==========================

======================
strace -fp 30247
Process 30247 attached - interrupt to quit
SYS_7(0x3ffffaf9078, 0, 0xc350, 0, 0, 0x3ffffafc070, 0x800291fc, 0, 0x2000336dc40, 0x3ffffaf9088, 0x2000076e000, 0x20000741518, 0x200006e7840, 0x3ffffaf8fd8, 0x200007a7f30, 0, 0, 0, 0, 0, 0, 0, 0x3ffffaf9078, 0x8000000000000, 0x4050000000000000, 0, 0, 0, 0x4050000000000000, 0, 0, 0) = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT) = 0
nanosleep({0, 50000000}, NULL) = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT) = 0
nanosleep({0, 50000000}, NULL) = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT) = 0
nanosleep({0, 50000000}, NULL) = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT) = 0
nanosleep({0, 50000000}, NULL) = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT) = 0
nanosleep({0, 50000000}, NULL) = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT) = 0
nanosleep({0, 50000000}, NULL) = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT) = 0
nanosleep({0, 50000000}, NULL) = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT) = 0
nanosleep({0, 50000000}, NULL) = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT) = 0
nanosleep({0, 50000000}, NULL) = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT) = 0
nanosleep({0, 50000000}, NULL) = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT) = 0
nanosleep({0, 50000000}, NULL) = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT) = 0
nanosleep({0, 50000000}, NULL) = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT) = 0
nanosleep({0, 50000000}, NULL) = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT) = 0
nanosleep({0, 50000000}, NULL) = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT) = 0
nanosleep({0, 50000000}, NULL) = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT) = 0
nanosleep({0, 50000000}, NULL) = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT) = 0
nanosleep({0, 50000000}, NULL) = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT) = 0
nanosleep({0, 50000000}, <unfinished ...>
Process 30247 detached
================================================

Any ideas? :S

all one can see is that the program polls for an event on filedescriptor 3 (which is obviously a socket), with a timeout of 60 seconds.
A POLLIN event is received, which means "there is data to read", and the return value of "1" means that one single structure has been returned (the one indicating "POLLIN").

The subsequent nonblocking recv() receives a message of length zero from that socket, so the process decides to go to sleep for a while to then reissue the poll() call.

So the reason why this process passes most of its time sleeping is that the socket becomes well ready to present data, but these data are of zero length - obviously either a communications problem or desired behaviour - maybe the other end of the communication path would post an event regularly to keep your process from timing out.

You could additionally issue

lsof -p pid

to check where file descriptor (socket) 3 is connected to. The descriptor number is in the FD column, connection info is in the NAME column.

If the process whose data you posted has ended in the meantime and you're going to examine a different process please make sure to check for the correct socket descriptor - that's the "fd=..." number in the poll() call.

In any case (desired behaviour or communications problem) you should see your development folks to present this analysis to them and ask them what's the deal.