Very Big Graphs

Thursday, January 13, 2005

Input Processing: The Joys of a Dodgy Hack

Just finished a rewrite of the input processing scripts. I now have a single Python script that reads a SOFTIE log from a file and does a few things with it. The first new feature is that it processes the option and parameter declarations output at the start of the log and warns the user if very_verbose, print_kept and print_deleted are not turned on. It understands enough to set and unset option flags and parameters, but the parsing is a little stupid. It will ignore warning lines, but treats everything else that looks like a parameter or option as a parameter or option, until it sees the start of the theorem proper.

It then proceeds to build a number of lists of clause Nodes. There is a list of all clauses and a list each of those that were kept, deleted and subsumed. Whilst building the lists of clauses, it keeps an eye on the time (i.e. the iteration) and increments it when necessary.

Once it is done building the lists of clauses, it, by default, prints out the list of kept clauses as a graph in the XML dialect used by prefuse, with the nodes and edges bearing the attributes that the visualisation software expects.

Though it would probably be faster if written with more care (or in a text processing language like Perl or awk), this new script has turned what used to be a 5 (or more) stage process with 3 passes over the log file, into a single stage process with one pass over the data. It doesn't spawn any processes and doesn't use much memory (as opposed to the old shell scripts, and the libxml2 Python scripts respectively).

Thursday, January 06, 2005

A Traditional Adjacency Matrix

John just raised the point that the size problem would be solved by using a 'full' adjacency matrix on the usable set (instead of all clauses). This would indeed give us a square of n size, with the triangular half of it filled. It might also be interesting in that it would allow us to, more easily, represent numerical statistics. These might include the number of clauses that were subsumed by a particular set of clauses, the number of children or the distance from the proof.

On the less positive side, it is possible that the plot might tend toward completeness. This may be a boon however as there may be some relation between its tendency toward completeness and the likelihood of finding a proof. Also, with a denser graph it may be reasonable (though probably quite naive) to assume that any blank space will be more meaningful.

These issues need, and deserve, to be addressed by implementing this idea. Sounds like I'm going to be trying to get a lot of little things done tomorrow.

The end is nigh. Or at least in sight.

The end is nigh, or in sight, depending on which way you look at it. There are only a few weeks left of the summer scholarship program, and a lot of work still to be done.

On the other hand, a meeting with my supervisor (John Slaney) today has reassured me that I'm not doing as badly as I had expected, and when you look at the graphs I've been drawing on a large scale, you can see why. The matrix style diagrams display a number of features that clearly illustrate properties of the system, and the problems being solved.

In the SET012 problem, we can clearly see repeating patterns which may be a graphical reflection of the symmetry of the problem (from Set Theory and regarding some relation between sets and another). Another interesting feature is a number of peculiar 'gaps' in the graph where large numbers of factors are generated consecutively. One of these gaps is several tens of clauses wide.

Also clearly visible are sloping vertical "lines" (in reality they are just adjacent points). This appears to be caused by the method used by the reasoning system. To generate new clauses from a given clause, the system attempts to unify it with the clauses in the usable set. This process traverses the list of usable clauses in reverse order which, for given clauses that unify with a large number of usable clauses, gives the appearance of sloping lines.

Other noticeable features include the appearance of a number of short horizontal "lines", strangely regular patterns, patterns that appear to be interrupted then continue and some patterns that bear a small resemblance to the Game of Life and other cellular automata.

I'm about to start working on adding some useful features to help give some more meaning to the graph. Unless I have a stroke of genius, this will probably be restricted to colouring the nodes (to easily differentiate factors, given clauses, etc) and highlighting those nodes that form part of the proof.

When these features are added in (hopefully be tomorrow or early next week), I'll move on to dynamic exploration. Some form of "zoom" function whereby those parts of the graph not in the immediate vicinity of the cursor are scaled down may be useful due to the rather long horizontal axis. Another feature that I'll have a go at implementing is a "cross-hair" on the cursor to allow the user to view other relations involving the node/s they are interested in, trace across to row and column labels, etc. Another possibility is to have the system display the text of a clause, the method that generated it and other possibly interesting data.

All in all, there is too much to do, and too little time.

Monday, December 27, 2004

Onward, to a less nonsensical visualisation!

SET012-clip2.png

The new, matrix inspired, visualisation method is coming along nicely and, as can be seen in the picture linked above, allows one to identify possibly interesting features of the dataset. In this case, there are three points early in the graph that are well above the line formed by the given parent edges. These three, and a number of strange plateaus in the line are probably cause by factorisation.

Factorisation is a technique used by the prover to 'modify' clauses and is done (in these cases at least) when the clauses are added to the "database" of clauses. That is, when they are created. This leads to points that ignore, almost completely, the rest of the graph.

The first thing on the agenda for the coming week is to finish and polish the visualisation itself. Dealing with the gap a the start of the graph caused by the clauses of the theorem not having any parents will be near the top of the list, as will testing the layout code to ensure that it is not piling nodes on top of each other (I believe that it is, though it didn't seem to be when I had a quick look I've had at this problem). When the drawing code is complete and correct, I'll move on to adding in some useful features like drawing the axes, labeling rows and columns, etc.

After the essential features are implemented, the visualisation will need to be evaluated with respect to the goal of my project. If it doesn't seem to make the search process any more obvious, I'll have to reconsider my approach to the problem. On the other hand, if this approach is useful, I'll start to add some more exotic abilities to the application focusing on the dynamic exploration of the dataset. Animation, dynamic colouring of graph elements and querying and constraining the dataset will all, I hope, help the user explore and understand the data set.

Another avenue that will need to be explored is the sheer size of the visualisation. The SET012 graph has the same [very approximately] dimensions as twenty seven and a half, 17 inch monitors side-by-side. Needless to say, this will be difficult to navigate, especially on machines with a single, or smaller, screens. Fortunately there is a lot of research the area of zoomable interfaces, and a number of toolkits like Piccolo and ZVTM should help implement some form of scaling.

On a more meta-blog note: For some reasoning the Blogger spell check feature insists on using an American dictionary and, therefore, trying to correct words that are perfectly fine in the rest of the English speaking world. If I happen to accept some bizarre substitution it suggests, please forgive me.

Wednesday, December 22, 2004

Java Programming Language

Java Programming Language, JDK 5

I've was just digging through the Java API docs and found my way to the description of the latest language enhancements. The changes to the language in JDK 5 (nee 1.5) are seriously cool. The descriptions of the new features and constructs read like a who's-who of modern language features:

typed-checked generics
a useful for loop
static imports
useful enums
annotations.

Java is growing up. The for-each style loop and typed generics, as much as I dislike the <> syntax, will significantly improve the experience of programming in Java and the comprehensibility of the resultant code. Static imports, annotations and enums are, in my opinion, more open to abuse, but with discipline they too will make Java a much more attractive language to code in.

I can't help but wonder how long it will take these features to filter into programming and software engineering courses.

Tuesday, December 21, 2004

Alternative Relations on Graphs

This project has, so far, focused on visualising graphs of only one relation on clauses, namely the parent-child relation. Whilst the addition of temporality adds an extra dimension to such a view, it would be interesting to expand the number of relations.

Another point which might bear investigation is the degree of temporal locality in the graphs growth. A graph with iterations (or whichever time quanta is appropriate) as nodes and the relation between the nodes of one quanta and another being an edge between them. The degree of the relation might be reflected in the weight of the edge. This could be visualised with a time quanta being positioned in a ring layout, with the temporal relation forming the ring, and the "share-a-relation" relation being the internal links.

Monday, December 20, 2004

Automated Deduction Image Gallery

http://rsise.anu.edu.au/~sutton/matrix/

I've started on a layout algorithm based on the concept of adjacency matrices. At the moment, it's a full matrix, but as soon as I can think of an elegant way to code it, I'll be tearing out as much of the white space as I can. In essence, the new layout will be a visual parallel of the adjacency lists data structure for storing sparse graphs.

We'll see how I go, but I think that this approach might be more general than a radial layout algorithm. Having animations of a few minimal spanning tree or network flow algorithms would be quite cool.

Saturday, December 18, 2004

Alternative Visualizations and Pseudo-semantic Orderings

I've started thinking about alternative visualization methods in the last couple of days. The most promising alternative is an adjacency matrix tweaked slightly to handle the added temporal "dimension". An adjacency matrix will [I hope] make patterns easier to spot, will be easier to code, and will probably be a lot faster to run. The added speed will enable the use of a number of run-time effects to aid in exploration of the data sets (highlighting related elements, selecting temporal "regions" of the data set, etc).

On the the other hand, the lecture yesterday by Thomas Meyer (which happened to be the final day of the Logic Summer School) was about belief revision and mentioned using epistemic entrenchment to impose an ordering on the set of clauses (or rather beliefs). The though struck me that this technique, or one similar, might be just what I need to impose some form of ordering on the set of clauses to be able to plot them effectively.

The only problem is that any such ordering would not be stable, so it would not allow comparison between two visualizations of different set of clauses (in the most obvious way at least). This might be applicable to both the radial network layout and an adjacency matrix. The radial layout might derive a clauses polar coordinates from its position in the ordering and from the time at which it was created. On the other hand, having a non-temporal ordering relation allow the adjacency matrix to be more useful in animating the graphs growth and development. I imagine that such a visualisation's appearance would be obvious to experts in automated reasoning, but it seems a lot more promising (and useful) than layout out the graphs to minimise edge crossings, etc.

Wednesday, December 15, 2004

A Radial Layout

All of my work to date [as is now reflected by back posts below] has been on two tasks:

generating runtime data dumps; and
experimenting with radial layout techniques.

Generating data has been relatively easy. The automated reasoning system (one based on SCOTT I believe) already had the capability to output most of the data I need (like some statistics and the generated clauses that are kept) so I simply added the capability to output clauses that are discarded. After the data has been generated, I have a whole heap of scripts in a number of languages to translate it into a variety of formats, from CSV-ish formats suitable for generating graphs of various statistics (using GNUplot) to a simple XML dialect which is then transformed into a number of others (for a couple of graph rendering packages) to dot (the language Graphviz uses to describe graphs).

On the radial layout front, I have been looking at papers describing a number of visualization toolkits and radial layout algorithms for graphs and trees. I've modified a layout component from one of these toolkits (prefuse) to display my data in a radial layout, where radial distance is proportional to time. Unfortunately, this leads to very messy graphs [123K].

I'm going to try and simplify the dataset and see if that will help resolve the problem. If not, I think we'll have to find another visualization for the data.

Summer Project

I remembered today that I had a Blogger account and thought that I should probably use it for something. What better purpose than to document my summer research project at the Research School of Information Sciences and Engineering of the Australian National University. This blog then will focus on my attempts to find a visual metaphor to help [humans] understand the search an automated reasoning system performs whilst attempting to prove a theorem.

Essentially this boils down to displaying very large graphs and the way they change over time. Some examples of preliminary work can be found at my project web page.

[Note: Posts previous to this one are based on work notes and have been back dated.]