Very Big Graphs: 2004

Monday, December 27, 2004

Onward, to a less nonsensical visualisation!

SET012-clip2.png

The new, matrix inspired, visualisation method is coming along nicely and, as can be seen in the picture linked above, allows one to identify possibly interesting features of the dataset. In this case, there are three points early in the graph that are well above the line formed by the given parent edges. These three, and a number of strange plateaus in the line are probably cause by factorisation.

Factorisation is a technique used by the prover to 'modify' clauses and is done (in these cases at least) when the clauses are added to the "database" of clauses. That is, when they are created. This leads to points that ignore, almost completely, the rest of the graph.

The first thing on the agenda for the coming week is to finish and polish the visualisation itself. Dealing with the gap a the start of the graph caused by the clauses of the theorem not having any parents will be near the top of the list, as will testing the layout code to ensure that it is not piling nodes on top of each other (I believe that it is, though it didn't seem to be when I had a quick look I've had at this problem). When the drawing code is complete and correct, I'll move on to adding in some useful features like drawing the axes, labeling rows and columns, etc.

After the essential features are implemented, the visualisation will need to be evaluated with respect to the goal of my project. If it doesn't seem to make the search process any more obvious, I'll have to reconsider my approach to the problem. On the other hand, if this approach is useful, I'll start to add some more exotic abilities to the application focusing on the dynamic exploration of the dataset. Animation, dynamic colouring of graph elements and querying and constraining the dataset will all, I hope, help the user explore and understand the data set.

Another avenue that will need to be explored is the sheer size of the visualisation. The SET012 graph has the same [very approximately] dimensions as twenty seven and a half, 17 inch monitors side-by-side. Needless to say, this will be difficult to navigate, especially on machines with a single, or smaller, screens. Fortunately there is a lot of research the area of zoomable interfaces, and a number of toolkits like Piccolo and ZVTM should help implement some form of scaling.

On a more meta-blog note: For some reasoning the Blogger spell check feature insists on using an American dictionary and, therefore, trying to correct words that are perfectly fine in the rest of the English speaking world. If I happen to accept some bizarre substitution it suggests, please forgive me.

Wednesday, December 22, 2004

Java Programming Language

Java Programming Language, JDK 5

I've was just digging through the Java API docs and found my way to the description of the latest language enhancements. The changes to the language in JDK 5 (nee 1.5) are seriously cool. The descriptions of the new features and constructs read like a who's-who of modern language features:

typed-checked generics
a useful for loop
static imports
useful enums
annotations.

Java is growing up. The for-each style loop and typed generics, as much as I dislike the <> syntax, will significantly improve the experience of programming in Java and the comprehensibility of the resultant code. Static imports, annotations and enums are, in my opinion, more open to abuse, but with discipline they too will make Java a much more attractive language to code in.

I can't help but wonder how long it will take these features to filter into programming and software engineering courses.

Tuesday, December 21, 2004

Alternative Relations on Graphs

This project has, so far, focused on visualising graphs of only one relation on clauses, namely the parent-child relation. Whilst the addition of temporality adds an extra dimension to such a view, it would be interesting to expand the number of relations.

Another point which might bear investigation is the degree of temporal locality in the graphs growth. A graph with iterations (or whichever time quanta is appropriate) as nodes and the relation between the nodes of one quanta and another being an edge between them. The degree of the relation might be reflected in the weight of the edge. This could be visualised with a time quanta being positioned in a ring layout, with the temporal relation forming the ring, and the "share-a-relation" relation being the internal links.

Monday, December 20, 2004

Automated Deduction Image Gallery

http://rsise.anu.edu.au/~sutton/matrix/

I've started on a layout algorithm based on the concept of adjacency matrices. At the moment, it's a full matrix, but as soon as I can think of an elegant way to code it, I'll be tearing out as much of the white space as I can. In essence, the new layout will be a visual parallel of the adjacency lists data structure for storing sparse graphs.

We'll see how I go, but I think that this approach might be more general than a radial layout algorithm. Having animations of a few minimal spanning tree or network flow algorithms would be quite cool.

Saturday, December 18, 2004

Alternative Visualizations and Pseudo-semantic Orderings

I've started thinking about alternative visualization methods in the last couple of days. The most promising alternative is an adjacency matrix tweaked slightly to handle the added temporal "dimension". An adjacency matrix will [I hope] make patterns easier to spot, will be easier to code, and will probably be a lot faster to run. The added speed will enable the use of a number of run-time effects to aid in exploration of the data sets (highlighting related elements, selecting temporal "regions" of the data set, etc).

On the the other hand, the lecture yesterday by Thomas Meyer (which happened to be the final day of the Logic Summer School) was about belief revision and mentioned using epistemic entrenchment to impose an ordering on the set of clauses (or rather beliefs). The though struck me that this technique, or one similar, might be just what I need to impose some form of ordering on the set of clauses to be able to plot them effectively.

The only problem is that any such ordering would not be stable, so it would not allow comparison between two visualizations of different set of clauses (in the most obvious way at least). This might be applicable to both the radial network layout and an adjacency matrix. The radial layout might derive a clauses polar coordinates from its position in the ordering and from the time at which it was created. On the other hand, having a non-temporal ordering relation allow the adjacency matrix to be more useful in animating the graphs growth and development. I imagine that such a visualisation's appearance would be obvious to experts in automated reasoning, but it seems a lot more promising (and useful) than layout out the graphs to minimise edge crossings, etc.

Wednesday, December 15, 2004

A Radial Layout

All of my work to date [as is now reflected by back posts below] has been on two tasks:

generating runtime data dumps; and
experimenting with radial layout techniques.

Generating data has been relatively easy. The automated reasoning system (one based on SCOTT I believe) already had the capability to output most of the data I need (like some statistics and the generated clauses that are kept) so I simply added the capability to output clauses that are discarded. After the data has been generated, I have a whole heap of scripts in a number of languages to translate it into a variety of formats, from CSV-ish formats suitable for generating graphs of various statistics (using GNUplot) to a simple XML dialect which is then transformed into a number of others (for a couple of graph rendering packages) to dot (the language Graphviz uses to describe graphs).

On the radial layout front, I have been looking at papers describing a number of visualization toolkits and radial layout algorithms for graphs and trees. I've modified a layout component from one of these toolkits (prefuse) to display my data in a radial layout, where radial distance is proportional to time. Unfortunately, this leads to very messy graphs [123K].

I'm going to try and simplify the dataset and see if that will help resolve the problem. If not, I think we'll have to find another visualization for the data.

Summer Project

I remembered today that I had a Blogger account and thought that I should probably use it for something. What better purpose than to document my summer research project at the Research School of Information Sciences and Engineering of the Australian National University. This blog then will focus on my attempts to find a visual metaphor to help [humans] understand the search an automated reasoning system performs whilst attempting to prove a theorem.

Essentially this boils down to displaying very large graphs and the way they change over time. Some examples of preliminary work can be found at my project web page.

[Note: Posts previous to this one are based on work notes and have been back dated.]

Friday, November 26, 2004

Verbosity: Not always a bad thing!

It appears that we need to set very_verbose to get the prover to output clauses generated by factorisation, etc. Without this, we get only 5 clauses from LCL403 instead of the nearly 30000 expected.

Playing with the dumps from SET012, it becomes apparent that using clustered subgraphs may be useful. When a large inter-rank gap is used with clustered subgraphs for iterations, the resultant plot is almost understandable. Further experiments to try include:

creating edges between a given clause and its cluster of children (rather than a given clause and every child);
collapsing nodes introduced by 'factor' into their parents; and
correctly ordering the iteration clusters with respect to each other.

Creating "given" edges to clusters (instead of to all children) appears to be of some value. Perhaps it might be interesting to represent this differently in an animation: e.g. highlighting the current "given clause" during those frames that comprise an iteration.

Thursday, November 25, 2004

Data, Data, Everywhere

The amount of data produced by the problems may pose a problem. LCL005 for example, produces more than 600M of output and generates over 6,000,000 events.

Processing this data into a graph description is the next phase of the project. Some fairly trivial (and very naive) shell scripts have been created to split these files into a list of events for each iteration of the given clause loop. These are however, excruciatingly slow and will need to be rewritten in something a little faster. A few more scripts were created to produce some event statistics and plot them using Gnuplot. These also will need to be re-written in a language that has decent text processing facilities (so as to remove the need to call three cat|grep's per file). Awk may be a good choice.

I seem to be missing some proportion of KEPT clauses. A trace of LCL403 for instance does not list any KEPT clauses (except for those that form part of the theory) whereas the statistics say that 29208 clauses were kept. Most confusing.

Wednesday, November 24, 2004

Graphviz, Dot, Large Graphs and Crashing!

I came in this morning to discover that the Graphviz run with LCL005 consumed some nine and one quarter hours of real time and was killed, I can only assume, due to memory usage. Hacked up a simple C program to replace the shell script in converting "^clause" lines from runtime_data.dump to 'dot'. It removes duplicate parents, outputs them in ascending id order, etc. It's also a lot quicker than the shell script.

One of the smaller problems (SET012) ran in a rather short amount of time, and generated a graph that renders to 19616x809px in PNG. It will probably be a good idea to keep working with this problem due to its smaller graph. GRP024 also runs much quicker than LCL005 (though it still takes quite a lot longer than SET012).

Approaches may be:

Assign explicit ranks based on the iteration the clause was generated in then try to find a radial layout filter;
Have a style=invis central node, with edges to every other node, where each edge is weighted inversely proportionally to the iteration the clause was generated in (higher weights tend to shorter and straighter).

Explicitly setting node ranks to the iteration number seems "cleaner" to me. Another alternative is to "declare" all nodes to be in clusters, then define the edges between them. It would then be possible to use the cluster ranking options to let Graphviz have more control over the layout.

Another problem is that the prover does not appear to be able to provide enough information for my purposes. It does not appear to expose individual events (clause creation and deletion) or any information on clauses beyond the clause selected as "given" during each iteration.

Later:

I have added some code to the prover (and a flag to control it) that will output deleted clauses, though it does not (for reasons I have yet to determine) print a clauses ID. Once this problem is solved and the output is amended to output reasons for deletions, it should be possible to generate a data model for an animation, even if the animation itself is not yet possible.

Later still:

It looks like proc_gen() is breaking clauses before we get them. For example, the parents of a clause are printed (very_verbose), followed by a message "Subsumed by ?", then out "** DELETED" message no longer has access to the clauses parents. The problem was caused by using a copy of the clause that was made before being passed to proc_gen() which has the side affect, apparently, of adding the parents list to the clause structure.

Even Later:

A trace of proving the theorem SET012 with the appropriate options enabled (print_kept, print_deleted, etc) is 2.3M in total, of which all but 32K is useful event data.

Tuesday, November 23, 2004

Graphviz, Dot and Large Graphs.

Today I started looking at transforming the data dumped by the theorem prover into 'dot', the language used by Graphviz (and a number of other packages). A quick and extremely dirty script translated the data into 'dot' and I, foolishly as it turns out, started a test run going of LCL005, to see just how big a graph it is. When I went home for the day, it was still running.

I need to find an easy way to determine the iteration in which clauses are generated so that I can play silly buggers with weighted edges, explicit rankings, etc.