Can doing things right, bite?
We released our first product, AppLife Update, in late June. This product is an automatic updating solution for .Net that makes it really easy to add automatic updating features to any .Net application. Of its many features, one is that you can apply an update without needing administrative rights. We accomplish this through the use of a Windows Service that can execute an AppLife update package under the security context of the local system. To provide this feature securely we did a lot of work. We require administrative permissions to register any given application before it can use the service. We use RSA public/private key technology, we lock files, we double validate both inside and outside the service, we do lots of things. Somewhere, in all of that security decision making, we decided to sign all of our executable with an x509 Authenticode certificate. Seems pretty benign right? Heck, if you read the Windows Vista Software Logo Spec 1.1, you'll see that Microsoft thinks it's a good idea:
All executable files must be signed with an Authenticode certificate. This includes files with the following extensions: exe, dll, ocx, sys, cpl, drv, scr
Limited waivers may be available for 3rd party redistributables when singed versions are not available.
Before going out and buying our certificate, I read lots of information about Authenticode signing and it just seemed like the right thing to do. Heck the only real decision I felt I had to make was whether to pay more for a Verisign certificate or go with Thawte. We didn't. (I don't really understand that difference, especially when Versign owns Thawte, but that's a topic for another day. Perhaps the power of a brand?) Signing our assemblies just seemed like the right thing to do. So we did. Carte blanche. All exe files were code signed, along with msi's and msm's.
And when we were all done, we tested, tested and tested some more. We covered just about every potential environment we could muster in-house, and we found beta testers out there "in the real world" to run updates. We were updating in corporate environments, in small business environments, in home environments. Our solution just worked. So we shipped.
So what's the problem?
Along about the end of July we received a support email informing us that a guy in Germany couldn't install our software. He had 64-bit hardware running German 64 bit Vista. Not knowing what his issue was, we went to work trying to solve it. Whenever he tried to install our software, our windows service would fail to start. Looking into the msi logs, we found:
Error 1920. Service 'AppLife Update Service' (KjsUpdateService) failed to start. Verify that you have sufficient privileges to start system services.
We tried hard to reproduce his issue and we couldn't. We did find some issues with our product and 64 bit operating systems, and our product is now better for that, but no matter what we tried, we just were not able to reproduce this. I also got really good at navigating to the control panel and event logs on non-english Vista. This might come in handy some day, if more than a handful of customers ever actually make the move to Vista. This person then discovered on his own that if he disabled his network card, the service could start. Inquiring about this, we learned that his company used a Ken firewall system that is common in Germany, but does not have an English version. After spending loads of time on this and seeing the evaluator's enthusiasm to get to the bottom of the issue disappear (and who can blame him), we gave up. Our best guess at the time was some incompatibility with their firewall/proxy system was causing our service to fail. We really couldn't do much more because after all, we kinda needed him since we were never able to reproduce it.
Then in September, another evaluator reported a similar issue. He too was in Germany, but didn't have any unfamiliar networking software, but they were behind a proxy server. It was then that we first knew that we had a real problem and good or bad, we were happy to have another "tester" since we knew we hadn't fixed it and still couldn't reproduce it. So we littered our service code with additional exception handling and sent him a custom version of the service. He still couldn't install and we weren't getting any information out of our exception handling. This was very odd. It just seemed like either our code was not executing or something was hanging. We didn't have to look through much code since the service really does very little work on startup. One thing it does do is create an IPC Remoting channel. The IPC Remoting channel is based on named pipes. So we went into dissecting every bit of the framework IPC Remoting channel, looking for anything that might fail when executing under the local system account, or in a way that hangs the process. This lead us nowhere, except a better understanding of the framework IPC channels.
This story ends with the final realization that our code was likely never even being entered and with the proclamation that Google really is magical. I really don't remember how, but I found this blog post using Google.
And in there it talks about Certificate Revocation Lists and how the .Net framework attempts to verify a certificate and how this can be time intensive. BINGO! And I was not even looking for anything related to code signing.
The revelation was found.
Loading a code signed .Net assembly could take a long time if network access is being blocked to the security profile that is loading the assembly.
We rebuilt the service without signing the service executable and sent it off to our evaluator. Problem solved. After this, I went back and re-read about code signing, and the potential for this issue is just not presented anywhere in the educational literature that I could find. So file this little nugget and think about it anytime there is an inexplicable pause in loading a .Net assembly. Is it signed?
Through all this, I did try to enlist the help of others, but when you can't reproduce an issue, it's really hard to ask anybody to give the time of day to solving it.
And even after discovering this, we made boneheaded mistakes that lead to further issues by just not thinking. This week another evaluator, amazingly again in Germany, couldn't perform an elevated update using the service. It turns out that we missed removing the code signing step on assemblies that the service interacts with during an update process. I felt pretty stupid this week.
What's even more surprising is what the Framework does after the certificate validation fails. The assembly fails to get a specific attribute, but otherwise loads and executes just fine. In both of our problems, the actual reason for failure was timeouts waiting for activity that should not have taken so long. If our time out thresholds were longer, the process would have worked, but been unnecessarily delayed.
So finally, after almost six months since the first report of an issue, I think we have this problem completely solved.
We still code sign our primary executable and our installers, but that's it. We do this primarily because we distribute our software exclusively over the internet and we want to instill confidence in our products. Also, the worst that could happen is some users might experience a delay before our application starts.
The lesson learned here is a tough one. We decided to sign our assemblies based on a desire to follow Microsoft suggestions for trustworthy computing and doing the right thing. I now know that doing the right thing can bite. Hard.