Week 1

This document summarizes work done/being done in week 1.

1 'BFS' Style Traversal of Object Graph

It turns out that finding paths in a graph based on a regular expression is not an unusual problem. Other places where it is used includes the XML tree (using the XPath query language), biomedical networks, etc.

Such a query is called a Regular Path Query on a graph. I read this paper, which describes a BFS-way to traverse the graph/automaton. The problem they are solving is not exactly the same as mine (they are going for finding a path between any two nodes using a RPQ and using something which they call rare labels to speed it up), but I was inspired by their basic idea of traversal.

1.1 Naive Idea

The pseudocode for a naive traversal is given below, but here is some preliminary info:

  • G is the graph.
  • NFA is the nondeterministic finite automaton.
  • NFA[i] represents the ith state of the NFA.
  • NFA[i].followList[label] gives a (possibly empty) list of states NFA[k] such that (NFA[i], NFA[k]) are joined by the specified label.
  • Path(head, prevPath) is a recursive data structure.
  • Q is a queue containing 3-tuples of the form (nfaState, graphNode, Path).
  • Visited is a set containing 2-tuples of the form (nfaState, graphNode).
  • LivePaths is a set containing Path.
def BFSPathFind(NFA, G):
  Q.push((NFA[0], G[0], Path(G[0], None)))
  while (NFA[i], G[j], P) = Q.deq():
    Visited.add((NFA[i], G[j]))
    for (G[j], G[y], label) in G[j].edges:
      for NFA[z] in NFA[i].followList[label]:
        if Visited.contains((NFA[z], G[y])): continue
        Q.enq((NFA[z], G[y], Path(G[j], P)))
    if NFA[i].isSuccessState:
      LivePaths.add(P)
  return LivePaths

In an illustration, here's how the iterations look like for the graph below shown:

fig1.png

with a regular expression of the sort (x.(l|r)*.cross) + (x.l*.r), which gives us the automaton:

fig2.png

The iterations look like so (the * represents a success path, of which all nodes may be live):

Iteration1 Iteration 2 Iteration 3 Iteration 4
g0 g0 -> g1 g0 -> g1 -> g3 g0 -> g1 -> g3 -> g7
  g0 -> g2* g0 -> g1 -> g4* g0 -> g1 -> g3 -> g8*
    g0 -> g2 -> g4* g0 -> g1 -> g4 -> g9
    g0 -> g2 -> g5 g0 -> g1 -> g4 -> g10
    g0 -> g2 -> g6 g0 -> g2 -> g5 -> g11
      g0 -> g2 -> g5 -> g12
      g0 -> g2 -> g6 -> g13
      g0 -> g2 -> g6 -> g14

While this looks alright at a glance, there's a mistake in this sort of iteration. Consider the following scenario, where (gN, sM) represents a state of the graph and NFA respectively.

Suppose we have a state (g1, s1) pushed to the end of the queue at the end of processing for the state (g0, s0). At this point, (g0, s0) is marked as visited. Now, another state (g2,s2) which has a valid transition to (g0, s0) makes it to the head of the queue - but that's already visited, so it will never go down that path, and thus we will not mark g2 live even if it is live further down the path.

The problem boils down to this: while a search is ongoing for a path containing some (g, s), any other state which can transition to this will be wrongly marked dead.

The solution for this is basically that we somehow "wait" on these ongoing operations, and depending on their result, decide liveness.

1.2 'Ongoing' Pseudocode

The idea is to split the whole thing into two phases: a normal BFS (in which we keep track of the paths as shown above), but whenever we encounter an ongoing state, we simply save that path for later checking.

Another way to do it is to add the state which is requesting the ongoing state to the end of the queue (and we may need to do it multiple times), until the ongoing state is free (we'll have to manually mark it free).

If we have, for instance, Path -> (g, s) stored at the end, we simply need to check whether (g, s) is part of a live path or not. In case it is, we can conclude that the entire Path is live, on the other hand, if it is not live, the entire path isn't.

The pseudocode for a modified traversal is given below, notations added/changed from above code are given below;

  • Path(head, nfaState, prevPath) is a recursive data structure.
  • Ongoing is a set containing 2-tuples of the form (nfaState, graphNode).
  • Live is a set containing 2-tuples of the form (nfaState, graphNode).
  • ConservativePath is a set containing paths.

The BFSPathFind should be called for the main BFS, followed by BFSPathCollapse when we finally want to collapse all the saved paths to a list of graph nodes.

def UnwrapPath(P):
  while p != None:
    Ongoing.remove(p.nfaState, p.head)
    Live.add(p.nfaState, p.head)
    p = p.prevPath

def BFSPathFind(G, NFA):
  Q.enq(NFA[0], G[0], None)
  while (NFA[i], G[j], P) = Q.deq():
    if Live.contains(NFA[i], G[j]):
      UnwrapPath(Path(G[j], NFA[i], P))
      continue
    if Ongoing.contains(NFA[i], G[j]):
      ConservativePath.add(Path(G[j], NFA[i], P))
      continue

    Ongoing.add(NFA[i], G[j])

    for (G[k], G[l], label) in G[j].edges;
      for NFA[m] in NFA[i].followList[label]:
        Q.enq(NFA[m], G[l], Path(G[j], NFA[i], P))

    if NFA[i].isSuccessState:
      UnwrapPath(Path(G[j], NFA[i], P))

def BFSPathCollapse():
  accumSet = Set()
  do:
    accumSet = Set();
      for p in ConservativePath:
        if Live.contains(p.nfaState, p.head):
          accumSet.add(p);
          Live.union(zip(p.allNFAStates, p.allHeads))
       ConservativePath.setDifference(accumSet)
  while accumSet.size() != 0
  return Live.Map((nfaState, graphNode) -> graphNode)

2 Comparing against 'Mark'

This code should be run after a single mark phase (which will be used to build the memory graph). The total running time (mark + this code) is several orders of magnitude slower than just a single mark (which is what will run in a reachability based GC).

It's unfeasible to use this in this condition, so I need to look at some speedups.

3 Speedups

Here are some potential speedups I plan to add/have added. The first few are mostly from the paper above, while the last one was discussed earlier.

  1. NFA follow lists: instead of looping through all possible NFA transitions for a given state and then checking the label to make sure it is the same as the edge label, I keep a mapping of (NFA, label) to transitions. This is already added.
  2. Two Way BFS: Using a 2-way BFS, time can be improved further. Here are some rough pre-reqs to do it:
    1. Two-way NFAs (constructed during compile time) and Graph (constructed during the first mark phase).
    2. Marking the success states of the NFA (during the first mark phase). This could possibly be helped by minimizing the NFA as much as possible during the compile phase.
  3. Using rare labels for finding paths (I need to think about this a bit more since our paths always start at the root variables, and further, we might have to calculate which labels are rare at runtime).
  4. Not traversing the entire tree, instead, recording the memory used by each subtree under a node within the node. Then, after a number of steps, we decide whether or not to descend into the subtree depending on whether it is worth the trouble to collect within it.

4 Other Notes

  • I'm currently using my own implementation of NFAs, but I could take a look at digitalheir/java-nfa since it seems to give me the same sort of functionality. It seems that it also pre-computes transitions for a label (the follow lists) that I do.

Author: Milind Luthra

Emacs 25.3.1 (Org mode 8.2.10)

Validate