Sunday, November 20, 2011

Silverlight, cross-domain issues, and self-signed certificates

1 comments
I've been meaning to post this for quite sometime now as I haven't seen others with exactly the same issue. First, some context: when running a Silverlight application, it has some special security measures in place to avoid Cross-Site Request Forgery (CSRF). By default, Silverlight only allows site-of-origin communication - for instance, "http://blog.sacaluta.com/test.aspx" will be able to access "http://blog.sacaluta.com/myservice.svc", but not "http://www.example.com". In order to allow more than site-of-origin communication, a service owner must have a clientaccesspolicy.xml file in the root configuring which domains are allowed to access that service. If you're interested, this is explained in greater detail on this MSDN site.

The issue I ran into is that I had a Silverlight application and also a service, both running locally. My service had a proper clientaccesspolicy.xml configured to allow access from anywhere. And still my Silverlight would fail with the message:

"An error occurred while trying to make a request to URI 'https://MYDOMAIN/MYSERVICE.svc'. This could be due to attempting to access a service in a cross-domain way without a proper cross-domain policy in place, or a policy that is unsuitable for SOAP services. You may need to contact the owner of the service to publish a cross-domain policy file and to ensure it allows SOAP-related HTTP headers to be sent. This error may also be caused by using internal types in the web service proxy without using the InternalsVisibleToAttribute attribute. Please see the inner exception for more details. ---> System.Security.SecurityException ---> System.Security.SecurityException: Security error..."

After debugging the issue further, the problem was that my service had only a secure endpoint (SSL) and its certificate was self-signed (or did not match the domain, can't remember now). In that case, my Silverlight application would not download the service's clientaccesspolicy.xml and therefore declined access to it. Since I was running code within another larger application that I did not have control of, I did not investigate further whether one can configure to allow self-signed or mismatched certificates to be accepted during development. (In case you know if these are possible, please let me know!)

How did I get it solved? If you're running in Internet Explorer:

  1. Before loading your Silverlight application, first access the clientaccesspolicy.xml file. IE will alert about being self-signed or mismatched cert, but you can opt to proceed with it.
  2. In the same tab, access then your Silverlight application. It will be able to access your clientaccesspolicy.xml at that point, and the call will go through. 

Simple trick, and effective. I'd love to know if other browsers work the same. By the way, this was tested in Internet Explorer 9.

ReadyNas, WebDav, and "Method Not Allowed"

0 comments
I have a ReadyNas Duo network-attached storage, which I access via WebDAV only due to some permission conflicts if I use different protocols to write files to it. Given that Windows does not support WebDav properly in my case, I installed a WebDav client called BitKinex. I configured it to point to my share and guess what: "HTTP: Method Not Allowed (/)" error. This is the dialog:The problem is that BitKinex, by default, points to the root of your server. In ReadyNas case, it has different shares, and you must point to the right share to get it fixed. In order to do that, right click on the WebDav connection and select "Properties". Go to "Site Map" and update "/" with your share name (in my case, "/documents").Then it works fine.

Thursday, November 17, 2011

.NET: do not use System.Uri for domain validation

1 comments
Last time I talked about System.Uri, I was talking about a bug that prevents trailing dots from being used for REST resources. Now the issue is different: how about relying on System.Uri for domain validation?

It's not uncommon to see System.Uri being used to validate an input that is supposed to be a domain name. I've seen code like this trying to validate domains:
public static bool IsDomainValid(string name)
{
try
{
new Uri("http://" + name);
}
catch (UriFormatException)
{
return false;
}
}
Or, besides relying on UriFormatException or on the the Host property, something like this:
public static bool IsDomainValid(string domainName)
{
try
{
if (StringComparer.OrdinalIgnoreCase.Equals(new Uri("http://" + domainName).Host, domainName))
{
return true;
}

return false;
}
catch (UriFormatException)
{
return false;
}
}
Preliminary tests with this code shows that completely wrong domains (like 3@#@@.com) are rejected, so it seems to be a great code. And the best is that we don't have to write any domain validation ourselves.

Now, what about the following domains?

--------.com
-test.com
test-.com

They are all considered valid according to System.Uri(). However, according RFC 1035 or RFC 1123, they are not. According to RFC 1035, not even a digit only domain (like 999.com) is valid, but System.Uri() is fine with all of them.

I played with some of the internal flags and it seems that, if you use E_HostNotCanonical (256), it starts rejecting some of these invalid domain names, but I really couldn't understand the rules it follows. And since there are different RFCs and different interpretations, it would be really hard for System.Uri() to do a precise validation unless one passed the type RFC that the domain is expected to be compliant with.

At the end of the day, you're better off understanding the RFC you want to comply with and implementing the proper regular expression for that. In my case, I wanted it to be compatible with RFC 1123, so this is the regular expression I started with:

"^(?![0-9]+$)(?!-)[a-zA-Z0-9-]{1,63}(?<!-)$

And then relaxed it to the following after learning that digits only domains were accepted by RFC 1123 (there are multiple interpretations, but I read the RFC and was convinced that it was fine).

"^(?!-)[a-zA-Z0-9-]{1,63}(?<!-)$"

This is the regular expression per domain label (text between the dots). It does not apply to the rightmost label as it must not start with a digit - in order to differentiate a domain name from an IP address.

Also, this regular expression requires an explicit check that the entire domain is less or equal to 255 characters.

Wednesday, November 16, 2011

Regular expressions: backtracking can kill your performance

1 comments
Or why you should learn atomic grouping...

After last post on turning off useless backtracking by using atomic grouping, I kept on reading the Regular Expressions Cookbook and ran another experiment to validate the performance difference. Let's start with the results:

Normal Regex - # of loops: 1000, # of matches: 0, time (ms): 66149
Atomic Grouping - # of loops: 1000, # of matches: 0, time (ms): 11196

Note that a normal regex was 6 times slower than atomic grouping to fail to match.

The example from the book is a regex to match a well-formed html page. I saved a Wikipedia page, made a few adjustments to the program used in the last post (e.g. to read the html contents from file), and ran the tests.

The regular expression used is:
<html>(?>.*?<head>)(?>.*?<title>)(?>.*?</title>)(?>.*?</head>)(?>.*?<body[^>]*>)(?>.*?</body>).*?</html>
Every test I ran with this regex and with the version without atomic grouping, I can see the normal regex being 6 times slower. If you want to know more about using atomic grouping or a regular regex, please read my last post.

And this difference was found without setting the regex to be compiled. After setting this flag, these are the values I get:

Normal Regex - # of loops: 1000, # of matches: 0, time (ms): 49319
Atomic Grouping - # of loops: 1000, # of matches: 0, time (ms): 9471

Still a pretty significant change between normal regex and atomic grouping.

Tuesday, November 15, 2011

Regular expressions: turning off useless backtracking

0 comments
Last time I mentioned the nice feature on how to change quantifier to be greedy or lazy, and then help you match what you really want. This time, it is how to make your regular expression more efficient. First, we need to remember last post where greedy or lazy quantifier change how backtracking works. Sometimes backtracking just doesn’t make sense. Look at this example:
\b\d+\b
It is supposed to match integers at word boundaries (\b means boundaries). At first, it may be a bit hard to understand, but backtracking is unnecessary here. For instance, try to run this regular expression on an example: .
789abcdef654 123
At the point when the regular expression fails (when it checks that “a” is a word boundary), it doesn’t make sense to start backtracking to see if 9 is a word boundary (or 8, for that matter). We should just go ahead and move on to the next token.

This is where it pays off to know well the tool you’re using. Different flavors of regular expressions offer ways to avoid keep backtracking positions, so when a match fails, it just moves on. In case of Java and .NET, both support atomic grouping. .
\b(?>\d+)\b
Atomic grouping here is represented by “(?>)”. When the regular expression engine leaves the group, all backtracking positions are lost, so a failed match will not have any recourse other than moving on to the next characters to find further matches. It will give up on the current context.

In the example above, when it matches 789, it will leave the atomic group, so when \b fails to be matched there are no backtracking positions to be tried. From this quick analysis, we see that we avoid a lot of extra computation by avoid these useless backtracking.

The question about saving time is how much we actually save here. I wrote a test code to benchmark these different options and verify whether we are talking about any substantial savings or not.
static void Main(string[] args)
{
TestAtomicGrouping(1000);
TestAtomicGrouping(10000);
TestAtomicGrouping(100000);
}

static void TestAtomicGrouping(int numLoops)
{
Regex regex1 = new Regex(@"\b\d+\b", RegexOptions.Compiled);
Regex regex2 = new Regex(@"\b(?>\d+)\b", RegexOptions.Compiled);

StringBuilder sb = new StringBuilder();
for (int i = 0; i < 100; i++)
sb.Append('1');
sb.Append('a');
sb.Append(' ');
for (int i = 0; i < 100; i++)
sb.Append('1');

string testString = sb.ToString();
int firstMatchCount = 0;
DateTime start = DateTime.Now;
for (int i = 0; i < numLoops; i++)
{
if (regex1.IsMatch(testString))
firstMatchCount++;
}
TimeSpan firstTest = DateTime.Now - start;

start = DateTime.Now;
int secondMatchCount = 0;
for (int i = 0; i < numLoops; i++)
{
if (regex2.IsMatch(testString))
secondMatchCount++;
}
TimeSpan secondTest = DateTime.Now - start;

Console.WriteLine("Normal Regex - # of loops: {0}, # of matches: {1}, time (ms): {2}",
numLoops, firstMatchCount, firstTest.TotalMilliseconds);
Console.WriteLine("Atomic Grouping - # of loops: {0}, # of matches: {1}, time (ms): {2}",
numLoops, secondMatchCount, secondTest.TotalMilliseconds);
}
Now we need to see results:

Normal Regex - # of loops: 1000, # of matches: 1000, time (ms): 48.0028
Atomic Grouping - # of loops: 1000, # of matches: 1000, time (ms): 30.0017

Normal Regex - # of loops: 10000, # of matches: 10000, time (ms): 386.0221
Atomic Grouping - # of loops: 10000, # of matches: 10000, time (ms): 239.0137

Normal Regex - # of loops: 100000, # of matches: 100000, time (ms): 3145.1799
Atomic Grouping - # of loops: 100000, # of matches: 100000, time (ms): 2079.1189

So, at the end of the day, getting rid of backtracking can be quite significant if you’re matching this regular expression quite often. In this test, we could save 32% or more of the matching time just by “turning off” backtracking with atomic grouping.

Monday, November 14, 2011

Regular expressions: greedy vs. lazy quantifier

0 comments
I am reading the great book Regular Expressions Cookbook after seeing today a few things that I did not about regular expressions. I will get to the interesting regular expression I had to work on in a future post, but for now I will share something I found quite interesting: greedy and lazy quantifiers.

Let's start by trying to match a paragraph in HTML. A paragraph is typically surrounded by <p> and </p>. So, I would write regular expression as:

<p>.*</p>
This should take care of matching the paragraph, right? Partially right. If you have a long HTML, with multiple paragraphs, this will match from the first paragraph start (<p>) to the very last paragraph end (</p>). This "*" is actually called a greedy quantifier.

If you want to make it behave differently, you will want to use what is called lazy quantifier. This is just the regular question mark placed after another quantifier. Note that, if question mark is placed after a regex token, it means "zero or once". This is not what we are talking about here - question mark here after a quantifier means that it changes the quantifier behavior.

<p>.*?</p>
In the example above, it matches the first paragraph only, not the entire text.

Under the covers, the regular expression engine uses backtracking to match the expression. For a greedy quantifier, it eats up all the content that matches the current regular expression and then moves to the next token. In the case of the paragraph matching example, it reads the entire text until the very end. Then it moves on to the next token (in this case <) - and since it fails as the document finished, it back tracks, and tries to match < again. It keeps going back each character until it matches.

For the lazy quantifier, it repeats as few times as it can, moving to the next regex token (here <). If the token is not matched, then it back tracks and moves forward another time, seeing if the token is matched.

I was happy to learn this, as I had always asked myself how to control this behavior. And quite interesting to understand the regular expression engine behavior, what can come in handy. Just be careful when using the question mark. As said above, it can serve two purposes depending on where it's placed.

Sunday, November 13, 2011

Moving Channel 9 to Azure: good design principles

0 comments
Today I read a great article on Microsoft Channel 9 moving to Azure talking about the sound design principles in place and lessons the Channel 9 team shares about how to move a web site to run in the cloud.

One of the things that caught my attention is to see a Microsoft project using a distributed cache fleet running Memcache. Using a caching layer is definitely the right thing to do in many cases to make the site more scalable. I wonder why they haven't used Windows Azure AppFabric Caching. And also, after working on Amazon Elasticache before joining Azure, I'd be curious how they monitor their Memcache instances.

I was very glad to see modular code, coding to interfaces, and mostly dependency injection being used. While dependency injection is pretty popular in the Java world, it's still not as popular for many Microsoft developer. They mention dependency injection being used "for testing purposes but also to isolate you from very specific platform details". Very well done.

Division of labor is a right principle for environments where machines are not reliable. This is proper mindset about machines in the cloud: "In practice they tend to run a very long time, but you can’t depend on that fact." And breaking down the tasks and using worker roles to pick them up, connecting them via queues, seems a smart strategy (assuming you have proper monitoring on these queue depths in place). In particular, I like the fact that the Channel 9 did not just thought that instances run for a long time and released an architecture based on that, so potential problems could be addressed in the future. Unfortunately I've seen a lot of people with this mindset, and Channel 9 did very well here.

From the article, though, the only thing that could have been done better was to think about database sharding. Although SQL Azure will provide Federation, there are many things that service owners need to think about: what the database partition key will be, what queries will need to go over partition and impact potential scalability, what queries will need to be federated, etc. I am not very familiar with SQL Azure Federations and don't know if it will repartition automatically hot partitions, but if it doesn't, that's another task service owners need to prepare for. With all that said, you don't need to shard right away, but you need to think of that before you service version 1 goes out, otherwise scaling can be a major headache - and if you can't afford downtime, then that can be an almost impossible task to accomplish in some cases.

All that said, I was very glad to read about their work and their sharing the architecture and lessons publicly.

Link to the InfoQ article:
http://www.infoq.com/articles/Channel-9-Azure

Saturday, November 12, 2011

Why high code coverage is not enough

0 comments
Managers typically like high "code coverage", and oftentimes think that this means that the code quality is good. I agree that low code coverage definitely means that one doesn't have enough unit tests, but high code coverage may not mean much either. It's required but not sufficient. To prove this, let's take a look at one example.

Once upon a time, I saw the following regular expression in a production code. I will write it in C#, but the language or platform doesn't mean much.

public static bool IsValidIp(string ipAddress)
{
return new Regex(@"^([0-2]?[0-5]?[0-5]\.){3}[0-2]?[0-5]?[0-5]$").IsMatch(ipAddress);
}
Let's say now that you have one unit test to make sure that your "boundary case" is accepted.

Assert.IsTrue(IsValidIp("255.255.255.255"));
Now you are happy, get the code checked in, and brag that you have 100% code coverage for that IsValidIp method. And so what? A simple "192.168.1.1" IP address is not considered a valid address. Completely buggy code, but 100% code coverage.

That is why managers that really understand what is being developed and have the chance to spend time looking at the code can make a total difference in the final product's quality.

Note: on the case above, it's amazing that the developer did not Google'd for the right regular expression for Ip validation, did not write data-driven unit tests to make sure different Ips are being written, and that code reviewers did not review it properly.

On Zynga and its "give back stock or get fired" story

0 comments
This week it's been all over the place the news about Zynga CEO and its executives demanding that you either give back the not-yet vested stocks or face termination. One of the ways that startup companies have to lure employees into taking the risk is to offer equity - that's the currency for startups for the risk taking as well as for offering lower salaries and demanding long hours. Once you offer this equity, I think companies must honor their contracts. At the same time, this can tell you a lot about the company, its value, and whether other people will want to join them in the future.

However, Zynga is correct in trying to be meritocratic. Those who contributed more should have a bigger piece than those who are around but did not contribute much to the company's success. A good compromise would be to have policies stating that the stock grants are dependent on your performance evaluation. Some sort of multiplier would be applied in this case - if you reach the expectations, you will get 1x your stock grants, if you're a rock star, you could get up to Nx (e.g. 2x), and if you're an underperformer, you may get nothing at all. That is much more fair than just demanding stocks back or threat with termination. Of course no system is entirely fair, as it can subjective and politics always play a part there, but it's better than lure people into thinking that they will get their stocks if they stick around long nothing and then fire them primarily for this reason.

On other hand, though, how many other companies may be contemplating or actually firing people with unvested stocks to accomplish the same goal? At least one can say that Zynga was transparent on the reason why they would fire its employees. But this is the kind of transparency that one doesn't see quite often because it lowers all employee's morale and the company's moral values are questioned. A company that does the same, but not that openly, seems to be much better off as employees tend to still believe that the company abide by its moral principles and its worth putting in all the effort to make the company grow.

Thursday, November 10, 2011

Coding guidelines and readability

0 comments
The more experience I get in the industry, more and more I value great developers that know how to distinguish great from good code. And I am glad that I had a fantastic experience reading and working with the Linux kernel - most of the code I had seen at the time falls into the category of great code.

There are many aspects of a great developer that one can think of, but I'd like to focus here on code readability and maintainability. First, one quite important distinction, especially for those used to following strict "code guidelines". Code readability is not so much about where you place the brackets, or any style that can be verified by a static analysis tool. These are usually what don't really matter much in my opinion.

The real readability is about the art in writing your code, not its science. Things like how to properly break your code into methods, how to name variables, classes, and methods, how to use spaces properly, how to make proper use of the fixed-width characters or align/indent code.

In my opinion, great engineers know that the code must not just work, but it must be a work of art. Something that you and others can read in the future and maintain. It is NOT just about getting it work. A lot of code I read works, but they are not readable, not elegant, and oftentimes not efficient at all.

Digressing a bit, I miss systems where you actually need to get the max performance out of a system. That seemed to require better engineers than nowadays. Now, with web services and cloud computing, oftentimes one doesn't care much about performance as you can always get a faster box to run your code. After working for two cloud providers, I see that this can specially happen with those providing these cloud services. I wonder whether these engineers who just throw more CPU power or memory at a problem actually ever thought of the cost of running a service and that this is one of the things factored in that will make the difference between being profitable or not. Or, on a more philosophical side, if they really feel pride of the engineering in the code they write.

But back to coding guidelines, I really recommend that you read the article below (published in the ACM Queue magazine) if you want to get better at that. It has a few tables that you should print out and put it up on the wall where you can peek when writing your code.

Coding Guidelines: Finding the Art in the Science

Today I also came across a new O'Reilly book on the topic called "The Art of Readable Code", which seems to be quite interesting and probably delve into this topic in much greater detail.

The Art of Readable Code

URI segments, dots, REST, and .NET bug

0 comments
These days I learned about a bug in the System.Uri() class that would strip leading dots from URI segments. See an example:

http://host/test.../abc

becomes

http://host/test/abc

That happens if your client or your server is .NET. If your client believes you support the URL RFC correctly, it may send the request with trailing dots, and when it gets to your code, these dots are gone.

The implication is that, if this URI segment is actually a resource name, you may be in trouble. Let me show you a concrete example:
  1. Resource is created by posting to URL: http://host/addresses. At this point, the resource name is passed in the payload and your service will correctly accept these trailing dots. For example, let's say we create an address named "home." So far, so good.
  2. User tries to perform a REST operation on this resource. It could be something as simple as a GET on http://host/addresses/home. (dot included)
  3. In the case you have a .NET client, the request will go out as http://host/addresses/home (no dot). Of course your server will return the wrong data or an error (like 404 - not found)
  4. In case you have a non-.NET client, the request will go out correctly, but if your server is .NET-based, then you may have an issue. For instance, a WCF REST service will have this resource name parsed as "home" (no dots), which will also return the wrong data or an error.
The consequence is that, because of that, your .NET REST service should not allow dots. At least trailing dots. However, allowing dots everywhere but at the end is not desirable and quite possibly you will forbid dots altogether.

There's a workaround for this issue if you control both client and server code. However, in case your customers are generating client proxies, then you must document what they need to do.
MethodInfo getSyntax = typeof(UriParser).GetMethod("GetSyntax", System.Reflection.BindingFlags.Static | System.Reflection.BindingFlags.NonPublic);
FieldInfo flagsField = typeof(UriParser).GetField("m_Flags", System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.NonPublic);
if (getSyntax != null && flagsField != null)
{
foreach (string scheme in new[] { "http", "https" })
{
UriParser parser = (UriParser)getSyntax.Invoke(null, new object[] { scheme });
if (parser != null)
{
int flagsValue = (int)flagsField.GetValue(parser);
// Clear the CanonicalizeAsFilePath attribute
if ((flagsValue & 0x1000000) != 0)
flagsField.SetValue(parser, flagsValue & ~0x1000000);
}
}
}
The code above clears a flag that is set to canonicalize an URL as a file path. Yes, all URLs are thought to be Windows file locations.

Unfortunately this bug is known since 2008, but has never made into a .NET release. It is marked as fixed, but as of .NET 4 we are still waiting for the fix to be released.

Here you can find more details about this issue:

Internet Explorer vs. Chrome: Recover Session

0 comments
Working at Microsoft, I use Internet Explorer, but when I want to separate some sessions, I run Chrome at the same time. And I wanted to tell you about an experience I had last week.

Last weekend, after plugging in my Garmin heart monitor to download my bike run, my PC just rebooted (that happened more than once with this monitor). After coming back up, I was expecting that Chrome would recover my session, but not Internet Explorer (from what I recall, IE hasn't been very good at this). And guess what? Chrome did not recover one single of my tabs, while IE recovered the entire session correctly. That was clearly unexpected and a glad surprise.

In the end, I had to go over Chrome's history to recover some tabs that I hadn't read it, what I could avoid with IE.