Mapping the Internet with Traceroute

So the other day, I was trying to get traceroute working with python, so that I could do a traceroute to different computers, and see what IPs were in common between the different routes. I quickly learned that ICMP (the protocol behind ping/traceroute) is nowhere near as supported by the python socket library as TCP or UDP is. In fact, other than a single IP flag, ICMP has no support!

So after a bit of struggling, I got my own traceroute/ping working in python using raw sockets. One thing I learned was that you have to be root to open a raw-socket server in linux. That made me wonder: "Well, how do the official traceroute/ping applications work without being root?" Well, it turns out that those apps (at least in Ubuntu) are setuid to root, so when you run them, they get run with the privileges of the root user. Live and learn, I guess.

Anyways, I got the traceroute working, and then I used the generated routes to create graphs using python-graph showing what servers are common among the different routes. I then plotted the routes using Graphviz. Here are some of the better graphs generated by my program:

First, this is a graph showing the paths from my computer (while logged into the campus wireless) to a few of the computers in the Computer Science Instructional Facility (CSIF), and then to Google, Yahoo, Digg, and PoweredByToast. (PoweredByToast is behind static.theplanet.com... we don't use any of those fancy dedicated servers... :) ) Also note that I blurred the names of some of the CSIF's internal servers. They aren't top-secret or anything, and I don't believe in security-through-obscurity, but it's still probably not a very good idea to publish them... :)

Image of my program's first big traceroute

In this graph, I took the top 20 sites on Alexa, and mapped them. (I removed all Microsoft-owned sites, since they seem to reject all ICMP packets anyway, so we wouldn't have learned anything, and we'd have to wait for the program to timeout on all of the connections.)

Traceroute of Alexa's non-MS top 20

Problems with the program:

  1. I don't check any of the sequence/id fields. In fact, I ignore them completely. This is a problem. For example, try doing a traceroute to a microsoft server. (Microsoft ignores ICMP messages, so the traceroute will eventually hang.) Now, use the normal ping utility to ping google. My program will see the ICMP message, and assume it's from Microsoft! Oops!
  2. The traceroute ignores failed links! For example, say we have servers A, B, and C. A is connected to B, and B to C. If we somehow fail to ping B, then the traceroute program will happily state that A is connected directly to C. Oops!
  3. If there are multiple paths to a destination, or one address resolves to multiple destinations, then the resulting route may include nodes from either path. This happened several times when I was mapping the route taken to google.
  4. This doesn't work behind a NAT router. I know it is possible to get it working eventually, but I'll figure that out at some other time...
  5. Try doing a traceroute on yourself. The program freaks out, because it doesn't understand why the remote server would be doing a PING REQUEST. :)

Anyways, the code can be found here:

Dependencies:

Source: http://www.poweredbytoast.com/files/tracert.py.txt

Note: You need to be root to use this program! So usage would be something like:

$ sudo python tracert.py google.com yahoo.com youtube.com ...

Post-Order Calculators

I've been working on a calculator program recently, and so I've had to review post-order notation (also called Reverse Polish notation) for formulas. This notation is where the two operands to an operator come before the operator itself. So "2 + 3" would be written in post-order like: "2 3 +". Order of operations, and parentheses can be removed easily as well, since something like: "2+3*(5-1)" would become: "5 1 - 3 * 2 +".

The great thing about post-order notation is that it has no ambiguity, since order-of-operations plays no role. All calculations are handled with simple stack operations, so the calculators for post-order notation are pretty easy to code, and the calculator is super-fast! This quickie Python implementation is less than 30 lines long, including white-space and comments:

def formula_eval(command):
if type(command) != type([]): raise TypeError("Bad Input")

# The stack contains stored numbers:
stack = []
for i in command:
# If this item is a number, just push it onto the stack:
if type(i) == type(1): stack.append(i)

# Else, it should be a one-character string, indicating the operation:
elif type(i) == type(""):
# Get the two operands:
a = stack.pop()
b = stack.pop()

# Figure out what we should do, and do it:
if i == '+': stack.append(a+b)
elif i == '-': stack.append(a-b)
elif i == '*': stack.append(a*b)
elif i == '/': stack.append(a/b)
elif i == '%': stack.append(a%b)
elif i == '^': stack.append(a**b)

else: raise Exception( "Unknown Command: <" + i + ">" )
else: raise TypeError("Bad Argument Type")

if len(stack) != 1: raise Exception("Malformed Equation")
return stack[0]

Then, we can feed this function something like:

cmd = [ 100, 40, '-', 20, 10, '*', '+', 2, '/']

... and we will get something is like:

>>> formula_eval(cmd)
130

So great. We have this evaluator that is real quick at evaluating an equation, but it only accepts formulas in some crazy notation that is completely unnatural to use. Wouldn't it be nice to be able to take something like "5 + 3 * (1 - 4)", and turn it into: "1 4 - 3 * 5 +"?? Well, luckily, there is a fast method to do this.

Before I actually describe the algorithm, let's go over what exactly it should do. Our converter needs to take order-of-operations into consideration, so that 1 + 2 * 4 gives us 9, and not 12. Secondly, our converter needs to be able to handle parentheses. Thirdly, it should handle function calls, like: sin 90, or log(100). Finally, it needs to handle both left-to-right, and right-to-left associativity. (So that x^y^z is handled like: x^(y^z), and not: (x^y)^z.

For the sake of simplicity, let's assume that we have some sort of lexical scanner has already converted an input formula string kind of like this:

>>> lexical_scanner("5+3*sin(1-4)^2^3")
[5, '+', 3, '*', 'sin', '(', 1, '-', 4, ')', '^', 2, '^', 3]

IE: It doesn't do any actual converting. It just turns the formula into an array, so that we don't have to worry about certain ambiguities that could arise during parsing. (We just care about the converting here.)

First of all, we'll have two stacks: an operator stack, and the output stack. Before we push a symbol onto the operator stack, we'll first pop() items off the operator stack and push them into the output stack while the top() of the operator stack has a greater than or equal precedence as the new symbol. We then push the new symbol. However, there are a few special cases.

First special case is the parentheses. When we encounter a '(' character, we just push it, and make sure it has a higher precedence than any of the other operands. Then, when we encounter a ')', we pop() every item in the operator stack into the output stack until we encounter a '(' operator. (Note: neither the '(' nor the ')' get entered into the output stack.)

The second special case is the function call. Functions, unlike the operators, have only 1 argument, and that argument is always the next item, whether it is a number or a parenthesized block. (ie: in "sin 90", "90" is the single operand. In "sin (90*4)", then "(90*4)" is that operand.) So, when we encounter a function name, we should just push it onto the stack. Then, whenever we handle either a number, or a ')', we pop all function names off of the operator stack, and into the output stack.

The final special case involves the '^' symbol. These are tricky little suckers, since they have a right-to-left precedence, unlike the other operators. We can handle them by making it so that pushing a '^' symbol won't pop() another '^' symbol. Simple enough. :)

A sample implementation follows:

def formula_convert(command):
if type(command) != type([]): raise TypeError("Bad Input")

# Flushes elements while some lambda function returns True:
def flush(cond):
while len(ops)>0 and cond(ops[-1]):
output.append(ops.pop())

# This thing tells the function what to pop() with each symbol:
to_pop = {}
to_pop['+'] = ['+', '-', '*', '/', '^']
to_pop['-'] = ['+', '-', '*', '/', '^']
to_pop['*'] = ['*', '/', '^']
to_pop['/'] = ['*', '/', '^']
to_pop['^'] = []

# These are our defined functions:
functions = ['sin', 'cos', 'tan', 'asin', 'acos', 'atan', 'log', 'exp', 'ln']

# Try to get rid of the ending special case:
command = ['('] + command + [')']

# The two temp stacks:
ops, output = [], []
for i in command:
# Numbers are directly added to the output:
if type(i) == type(1):
output.append(i)

# Flush out the functions from the top of the stack:
flush(lambda item: item in functions)

# Else, we handle the basic operators:
elif i in to_pop:
flush(lambda item: item in to_pop[i])
ops.append(i)

# Is this a function or an opening parenthesis?
elif i in functions or i=='(':
ops.append(i)

# Is this a closing parenthesis?
elif i == ')':
flush(lambda item: item != '(')

# Verify that the ')' had a matching '(':
if len(ops)>0: ops.pop()
else: raise Exception("Unmatched Parenthesis")

# Flush out the functions from the top of the stack:
flush(lambda item: item in functions)

# Uh-oh.
else: raise Exception("Unknown Symbol Encountered: " + str(i))

return output

So there we have it. Given a parser, we can convert a formula into post-order format, and then execute it without ambiguity. Pretty cool.

PS: After writing this thing, I looked it up, and Dijkstra came up with a similar algorithm, and it handles a few extra cases, such as the case where a function has several arguments, like: "log(10,100)", or something. It's called the Shunting-Yard Algorithm. It's nice to know that my solution wasn't totally out there, since Dijkstra came to a nearly identical solution. :)

Hamiltonian Paths

Ok, now for my first REAL entry. I've been looking the Hamiltonian Path problem, recently. For those of you who don't know, this is a really hard problem in Computer Science/Graph Theory. The problem is: "Given a graph, and two nodes, does there exist a Hamiltonian Path between those two nodes?" (A Hamiltonian Path is a path that travels to every single node, and never visits the same node twice.)

The naïve solution is pretty straight-forward:

import graph

def _naive_ham_helper(g, pts, dest):
# If we have used all points in the graph, but one, then the next
# hop had better be our destination.
if len(pts) == len(g.nodes())-1:
if dest in g.neighbors(pts[-1]):
return pts + [dest]
return None

# Else, we'll test every outgoing path, so long as we haven't tried
# it already, or it's the destination. (We aren't ready for it,
# yet...)
for i in g.neighbors(pts[-1]):
if i==dest or i in pts:
continue
p = _naive_ham_helper(g, pts+[i], dest)
if p:
return p

# Else, this branch didn't find anything. Tell the parent branch to
# keep looking.
return None

# The driver for the above function...
def naive_hamiltonian(g, start, dest):
return _naive_ham_helper(g, [start], dest)

This function pushes the first node, and then keeps recursively trying new paths, until we have reached the end, and used every node. For small graphs, this algorithm is ok, but when we start working with larger graphs (ie: in the hundreds to thousands of nodes...) this algorithm will be painfully slow.

We can do several things to try to get an answer quicker. First, we can make sure the graph is connected. If it is disconnected, then a Hamiltonian Path isn't possible. We can also scan the graph for nodes with degree 1 that aren't a starting/ending point. If one of those exist, then a Hamiltonian Path isn't possible. Also, if there is a connecting node (a node such that if it is removed, then the graph gets disconnected) and both the starting/ending points are on the same side of that point, then once again, the Hamiltonian Path isn't possible. While these steps may help to quickly find an impossible case, they don't actually speed up the search for a solution.

I've been tinkering with a purely heuristic algorithm, that is able to find a solution pretty reliably in many simple cases, but definitely not all. For example:

Failed Hamiltonian Path

Where the path is trying to go from node 1 to node 2, the Naïve (read: correct) solution is bolded, and my dumb-ass algorithm is in blue. (Notice that it's forgetting to go to node 7...)

More on this later. This post is long enough, and I need to work on homework... :X


About

User