3-way merge of PowerPoint with git

This doesn’t work for some reason, but I’ll make a note and look at it later. On the Mac’s Script Editor, the dictionary for MS PowerPoint 2011 says it has a method called mergeWithBaseline. That sounds like a 3-way merge. This is a sketch of what the code should look like. For some reason I can’t get mergeWithBaseline to do anything.

#!bash
# pptmerge LOCAL REMOTE BASE MERGED
osascript -l JavaScript << EOF
PPT = Application("Microsoft PowerPoint");
PPT.activate()
local = Path($1)
PPT.open(local)
PPT.activeWindow.presentation.mergeWithBaseline({withRevisionPath: $2, withBaselinePath: $3})
PPT.activeWindow.presentation.endReview()
PPT.save({in: $4, as: "save as presentation"})
PPT.quit()
EOF

In the .gitconfig there should be something like this:

[mergetool "powerpoint"]
    cmd = pptmerge \"$LOCAL\" \"$REMOTE\" \"$BASE\" \"$MERGED\"

It looks like you can select different merge tools with “git mergetool –tool=powerpoint fileName”. Is there a way to trigger this by file extensions (.pptx)?

Build an interpeter in .NET

A long while ago I noted one could build an efficient language interpreter if the runtime could inline operations more aggressively. .NET has long had an attribute called MethodImplAttribute that gives the runtime some hints on how to compile a method. In .NET 4.5 they added one more hint: AggressiveInlining. This instructs the JIT to inline the method if possible. It took some doing, but I verified that it does inline methods most of the time. This post talks more about it. Plus, at the bottom is a comment from (probably) a .NET dev that says, “It simply turns off all our heuristics, and causes us to inline the method if the JIT is capable of inlining the method. The reason the documentation is so vague is that there are limitations to what the JIT can actually inline…”. So it works, but it might not sometimes.

To determine if inlining is working you can scan the ETW output from the JIT as described here. I did the following:

// Run wevtutil only once on your machine. It should work fine forever after. :)
wevtutil im C:\Windows\Microsoft.NET\Framework\v4.0.30319\CLR-ETW.man
// Do the following each time.
logman start clrevents -p {e13c0d23-ccbc-4e12-931b-d9cc2eee27e4} 0x1000 5 -ets
// RUN YOUR PROGRAM IN RELEASE MODE, NOT IN THE DEBUGGER
logman stop clrevents -ets
tracerpt clrevents.etl

Search in dumpfile.xml for MethodJitInliningSucceeded and MethodJitInliningFailed. They will give you a reason for what it chose to do. To make this a bit easier you might be able to use Reactive Extensions to monitor the logfile as described here.

The goal is to inline and optimize the interpreter directly into the program, thus creating a compiled program. This is the first Futamura projection. This is already pretty good, but to get the 2nd and 3rd projections you’d need to do the inlining at the bytecode level at compile time, not at runtime. That way you can save the optimized assembly and specialize again. This should exist somewhere. I couldn’t find anything comparable in the JVM. Perhaps LLVM? Maybe Rosyln?

Storm w/ Reactive Extensions and Dataflow

The Apache Storm project is a distributed dataflow framework. It’s used by Twitter to process a continuous stream of tweets through a network of machines. Microsoft’s TPL Dataflow library is similar, but only works on a single process. To get an approximation in .NET I want to convert MSMQ into a Spout.

One solution is to convert MSMQ into an Observable that pushes elements into a Dataflow block. I packaged this into the constructor for a QueueSourceBlock, which implements ISourceBlock. Here’s the code snippet:

public QueueSourceBlock(string queueAddress)
{
    var queue = new MessageQueue(queueAddress);
    queue.Formatter = new XmlMessageFormatter(new Type[]{typeof(T)});
    var tb = new TransformBlock<Message, T>(m => (T) m.Body);
    block = tb;
    var queueObserveOnce =
        Observable.Defer(
            () =>
                Observable.FromAsync(
                    () => Task<Message>.Factory.FromAsync(queue.BeginReceive(), queue.EndReceive)));

    queueObservable = Observable.While(() => true, queueObserveOnce);
    queueDispose = queueObservable.Subscribe(m => tb.Post(m));
}

The FromAsync method will receive only one message from the queue. The Defer method will generate a new FromAsync call on demand. The While method will keep calling the Defer forever. Whenever a message arrives from the Observable, it calls Post to push it into the TranformBlock. This block will extract the data and send it to the next node it’s linked to. This code doesn’t handle cancellation. AFAIK, there’s no way to cancel the BeginReceive on a queue, but I could support cancellation in other places.

Surprisingly there doesn’t appear to be a way to split a stream in .NET’s Dataflow. They’ve got a JoinBlock that merges streams, but not a SplitStream. I think if you increase the parallelism in a block it behaves sort of like “shuffleGrouping” in Storm. Still, it’s a weird oversight.

Simple Dropbox client w/ Reactive Extensions

Reactive Extensions is ideally suited to the task of monitoring a directory for changes. While the events from FileSystemWatcher are ok, it isn’t efficient. The goal is to send as little data as possible to the server. If a file is moved or copied, there’s no need to upload the file again. Instead, you should recognize the event on the client and simply send a Moved or Copied message pointing to the original file on the server.

Here’s a simple prototype. I left out all the code to track the files with hashes. It gets a little trickier trying to track directories. If a directory is deleted, the entire subtree is deleted. You can use SQLite with the closure.c extension to track hierarchical data.

Anyway, this proof-of-concept is easy. Cut and paste lost some formatting. Stupid tabs.

    public class DropboxClient
    {
        private readonly FileSystemWatcher watcher;

        public IObservable<DropboxEventArg> FileSystemObservable { get; private set; } 

        public DropboxClient(string home)
        {
            watcher = new FileSystemWatcher
            {
                Path = home,
                EnableRaisingEvents = true,
                IncludeSubdirectories = true
            };

            SetupRx();
        }

        private void SetupRx()
        {
            var changed = Observable.FromEventPattern<FileSystemEventArgs>(watcher, "Changed");
            var created = Observable.FromEventPattern<FileSystemEventArgs>(watcher, "Created");
            var deleted = Observable.FromEventPattern<FileSystemEventArgs>(watcher, "Deleted");
            var renamed = Observable.FromEventPattern<FileSystemEventArgs>(watcher, "Renamed");

            // Often it repeats a change event for every little change (timestamp, file size, etc). 
            var dbchanged = changed
                .DistinctUntilChanged()
                //.Do(UpdateRecord)
                .Select(fe => new DropboxEventArg(fe.EventArgs));

            // This seems to work fine, I think
            var dbrenamed = renamed.Select(fe => new DropboxEventArg(fe.EventArgs));

            // Deleted is ok, too
            var dbdeleted = deleted
                //.Do(DeleteRecord)
                .Select(fe => new DropboxEventArg(fe.EventArgs));

            // If file already exists, then a created file is a copy of another file
            var dbcreated = created
                .Select(fe =>
                {
                    if (FileExists(fe.EventArgs.FullPath))
                        return new DropboxEventArg(fe.EventArgs, DropboxChangeTypes.Copied);
                    else
                    {
                        //CreateRecord(fe);
                        return new DropboxEventArg(fe.EventArgs);
                    }
                });

            FileSystemObservable = dbchanged.Merge(dbrenamed).Merge(dbdeleted).Merge(dbcreated);
        }

        private void CreateRecord(EventPattern<FileSystemEventArgs> fe) {
            // Create row in repo
            throw new NotImplementedException();
        }

        private void UpdateRecord(EventPattern<FileSystemEventArgs> obj)
        {
            // If file size is different, rehash and update
            // If dir, maybe do nothing. Not sure.
            throw new NotImplementedException();
        }

        private void DeleteRecord(EventPattern<FileSystemEventArgs> obj) {
            // Delete file from repository
            // If directory, delete entire subtree from repo
            throw new NotImplementedException();
        }

        private bool FileExists(string fpath)
        {
            // If file, hash and lookup 
            // If dir, maybe do nothing
            return false;
        }
    }

    public enum DropboxChangeTypes
    {
	Changed,
	Created,
	Deleted,
	Renamed,
	Moved,
	Copied
    }

    public class DropboxEventArg {
	public DropboxChangeTypes ChangeType;
	public string FullPath;
	public string Name;

	public DropboxEventArg()
	{
	}

	public DropboxEventArg(FileSystemEventArgs fe)
	{
		FullPath = fe.FullPath;
		Name = fe.Name;
		switch (fe.ChangeType)
		{
			case WatcherChangeTypes.Changed: 
				ChangeType = DropboxChangeTypes.Changed; break;
			case WatcherChangeTypes.Created:
				ChangeType = DropboxChangeTypes.Created; break;
			case WatcherChangeTypes.Deleted:
				ChangeType = DropboxChangeTypes.Deleted; break;
			case WatcherChangeTypes.Renamed:
				ChangeType = DropboxChangeTypes.Renamed; break;
		}
	}

	public DropboxEventArg(FileSystemEventArgs fe, DropboxChangeTypes ct)
	{
		FullPath = fe.FullPath;
		Name = fe.Name;
		ChangeType = ct;
	}
    }

Probabilistic Reactive Extensions: ProbablyDistinct

Reactive Extensions has an operator called Distinct. As data streams through, it filters out any items it has already seen before, allowing only unique items to pass to OnNext. The problem is Distinct stores all the unique items in a HashSet, which means it will consume a lot of memory if the number of unique items is large. The solution is to implement a ProbablyDistinct operator that uses a Bloom Filter to store unique items. A Bloom Filter is a very compact data structure that tests each item and replies “item is probably in the set” or “item is definitely not in the set”. In this case, there will be some false positives (it says it’s in the set, but it’s actually not). But it’s a tradeoff some applications might need, particulary long-running server apps that see lots of unique data items.

More broadly, it would be useful to have a few Rx operators that use probabilistic data structures. For example, the HyperLogLog can be used to estimate the count of large numbers of distinct items using very little storage. Another useful operator is the opposite of Distinct (Indistinct? SeenItBefore?). This is how most people use a Bloom Filter. Rather than do an expensive DB lookup, it first checks the filter to see if it is probably in the DB and then does the query.

Here’s the source for Distinct. All you have to do is replace the HashSet with a Bloom Filter (+ minor code tweaks).

        private static IObservable<TSource> Distinct_<TSource, TKey>(IObservable<TSource> source,
            Func<TSource, TKey> keySelector, IEqualityComparer<TKey> comparer)
        {
            return new AnonymousObservable<TSource>(observer =>
            {
                var hashSet = new HashSet<TKey>(comparer);
                return source.Subscribe(
                    x =>
                    {
                        var key = default(TKey);
                        var hasAdded = false;

                        try
                        {
                            key = keySelector(x);
                            hasAdded = hashSet.Add(key);
                        }
                        catch (Exception exception)
                        {
                            observer.OnError(exception);
                            return;
                        }

                        if (hasAdded)
                            observer.OnNext(x);
                    },
                    observer.OnError,
                    observer.OnCompleted
                    );
            });
        }

LINQ Unfold Operator

In Scheme’s SRFI-1 List library there’s a fold and unfold function. The fold is similar to the Aggregate operator in LINQ. However, there doesn’t appear to be anything like an unfold in LINQ. What is it exactly? The Aggregate operator takes an IEnumerable and applies a Func to the elements until you get a single result value. The unfold operation takes a single value and returns an IEnumerable. I needed this because I wanted an easy way to get all the parent classes and interfaces for a type T.

/// seeds: the initial data to unfold
/// stop: if stop(seed) is True, don't go any further
/// map: transform the seed into the final data
/// next: generate the next seed value from the current seed 
public static IEnumerable<R> UnFold<T,R>(this IEnumerable<T> seeds, Predicate<T> stop, 
                                         Func<T,R> map, Func<T,IEnumerable<T>> next) {
	foreach (var seed in seeds) {
		if (!stop(seed)) {
			yield return map(seed);
			foreach (var val in next(seed).UnFold(stop, map, next))
				yield return val; 
		}
	}
}

I’m not too happy about this code, but it should probably work. Now to produce all the parents of a type T:

var parents = new[]{someType}.UnFold(t => t == null, t => t, 
                                     t => t.GetInterfaces().Concat(new[]{t.BaseType}))
                             .Distinct();

For type theory nerds this is called an anamorphism, which is the dual of a catamorphism. There’s a way to generalize this to arbitrary recursive data structures, described in the paper “A Fold For All Seasons” (PDF link). I’ll leave it to someone else to generalize this beyond IEnumerable.

Rerunning an old benchmark

In 2011, Robert Hundt published this paper testing the performance of several languages on a loop detection task typically found in compilers. The code is available here. Go had to tweaked a bit. I ran the benchmarks on EC2 m3.medium instance with Ubuntu 14.04 (ami-1d8c9574). For some benchmarks there’s a Pro version where someone hand optimized the code. Here are the results: 

Language Runtime(sec)
C++ 48
Java 93
Java Pro 40
Scala 74
Scala Pro 28
Go 92
Go Pro 63
Python 891
PyPy 138

 

The following lessons can be learned: PyPy is awesome. Java, Scala, and Go are in the same league. Finding an expert to optimize your hotspots might double performance.

Versions:

  • g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2
  • java version “1.8.0_05″
  • go version go1.3 linux/amd64
  • Python 2.7.6
  • Python 2.7.6 (2.3.1+dfsg-1~ppa1, Jun 20 2014, 09:27:47) [PyPy 2.3.1 with GCC 4.6.3]
  • Scala code runner version 2.11.1 — Copyright 2002-2013, LAMP/EPFL