« September »
SunMonTueWedThuFriSat
   1234
567891011
12131415161718
19202122232425
2627282930  
       
About
Categories
Recently
Syndication
Locations of visitors to this page

Powered by blojsom

Radovan Semančík's Weblog

Monday, 23 August 2010

Recent discussions are about whether push or pull is the right model for future identity management. Unpractical standards are being revived. Everybody discussing the technology, the future, the visions. There is almost no discussion about the most difficult current problem of identity management: data.

There is (at least) one critical problem with implementation of single sign-on, identity federation, provisioning to the cloud and other fancy buzzwords. The problem is user database. It is not that difficult to deliver the information that someone should have access somewhere using whatever push, pull, standard or proprietary method - as long as you have that information. The reality is that enterprises does not have that information in a usable form. It is almost always distributed in several data stores, usually provided in incompatible formats, it is often inconsistent and sometimes even contradictory. And it is far from complete. Many provisioning cases are solved by non-algorithmic methods, e.g. manager or security officer deciding whether the request "looks valid enough" to be approved. The current situation is best described as semi-organized chaos.

How could anyone build an automated, Internet-scale, cloud-enabled and standards-based identity management mechanism on top of that? Hardly. Such project will most likely fail. But it will waste a lot of time and money before it fails.

The first step is to consolidate the data. Build a consistent user database, align policies, design business processes that can support 80% of cases with 20% of effort. It is naïve to expect that everything could be automated, therefore prepare for a reasonable amount of exceptions and human interactions from the very beginning. Single sign-on, identity federation, support for the cloud (whether push or pull) and even the standards will not provide any considerable help in that. It is mostly manual work of security staff, business people and engineers that is needed.

What can help is a well-designed and well-deployed provisioning system. In contrary to the popular beliefs the provisioning system is not really about provisioning. Yes, provisioning is a important part of the system, but other aspects are in fact much more important. Provisioning system can take data from several sources, covert them to a common format and merge them. Therefore it can create a unified database. Provisioning system can compare data among several system, correlating them, therefore detecting the inconsistencies. Provisioning system supports workflow and human interaction to clean up the data and supplement missing information. Both during initial migration and (most importantly) during the day-to-day operation.

Reasonable identity consolidation project including a decent provisioning system is a necessary pre-requisite for any other identity-related activity. It is a shame that engineers forget the Garbage In, Garbage Out phrase that was popular few decades ago. If the data are bad, any system built on top of such data can only be worse.

Posted by rsemancik at 11:09 PM in Identity
Monday, 16 August 2010

Only a small pieces remainded from the Sun Identity Manager (IDM) after Oracle swallowed Sun. One of the biggest pieces is Identity Connectors Framework. I guess it was an attempt to replace Sun IDM adapters - a part of the system that was asking for re-engineering since the times of Waveset back in 2003. Yet, Identity Connectors are not really an ideal replacement.

The goal of Identity Connector is to connect to a remote system. It should provide list of accounts, groups, roles or any other objects that are interesting from identity management point of view. It should be also able to manage such objects - to create, modify and delete them. Identity Connector Framework provides typical API/SPI abstraction to plug in connectors for various systems. It even seems to isolate the individual connectors to a separate classloaders. It allows to run connector on a remote host. So far everything is pretty much OK.

It is naïve to expect that all target system will use the same structure for accounts and groups and other objects. Therefore the connector provides a schema that describes how such objects are structured by the resource. That is major improvement over Sun adapters and it also should be a major competitive advantage of Identity Connectors. Only if it was designed a bit more carefully. First of all, it looks like the intent was for the schema to use native naming of object classes and attributes from the target system. E.g. complete LDAP schema with native object class names and attributes is visible using the connector. On the other hand, the framework provides pre-defined and fixed object class names __ACCOUNT__ and __GROUP__ and even predefined attribute names for __NAME__ and few others. I cannot understand this dichotomy: on one hand exposing the full native schema, on the other to hide it behind a cumbersome abstraction. This duality complicates the design of any system that is trying to use the connectors correctly. E.g. both __ACCOUNT__ and inetOrgPerson object classes are exposed by the LDAP identity connector and they are the same. Which one should be used? The root cause of this problem seems to go a bit deeper: the connector does not have clearly defined responsibilities. Should the connector only provide access to the object on target resource (i.e. be a simple protocol adapter)? If yes then there is no need for special __ACCOUNT__ and __GROUP__ object classes. Or should the connector also map the target system schema and provisioning system schema? For that it does not have sufficient features (see below). It looks like the connector goes somewhere in between, trying - and failing - to address both problems.

Yet another feature that I would expect from a "next generation" identity connector is to handle object references or membership. It means that provisioning system should be able to discover that there are two grouping mechanisms for LDAP and one (proprietary) mechanism for roles. That will allow provisioning system to provide much better GUI and business logic support out of the box. Yet, identity connectors do not provide such feature. The unix group attributes are of type string. There is no information in the schema that would suggest that a __NAME__ (or Uid?) of an object from the __GROUP__ object class belongs there. There is even no way how to know that there is more than one grouping mechanism (that is pretty common in LDAP and in many custom systems). How can provisioning system handle that? Either it cannot and leave that to business logic (which make deployment hard) or it can build an adaptation layer on top of connector. Abstraction on top of abstraction. Insane.

Overall it looks like the folks at Sun have chosen the wrong way. Identity Connectors are not that much of an improvement over older adapters that I would expect. It looks like the responsibilities should have been defined much better. And I would suggest that schema annotation is much preferable over schema transformation in this case. The Identity Connectors Framework needs a radical improvement even before they have gained any serious acceptance.

The punch line is that Oracle is planning to use this very mis-designed framework to improve their Oracle Identity Manager.

Posted by rsemancik at 12:22 PM in Identity
Monday, 12 July 2010

Architecture and design of software systems is quite an adventure. There are very little hard constraints in software and even less in software architecture. Almost anything can be designed. And vast majority of the designs will look good and feasible. Even if quite an intensive review process is applied. It is extremely difficult to find the mistakes in software architecture just by talking about it. As a consequence I dare to speculate that all non-trivial software architectures contain at least one error.

Software architecture needs to be put into the conflict with reality as soon as possible. Only the reality can uncover the problems. The architecture needs to be quickly applied to design. Key concepts should be designed down to the details early in the project. The design needs to rapidly lead to implementation of prototypes. Prototypes needs to be immediately tested. Problems need to be addressed as soon as possible. Solutions to the problems of prototypes will backfire to the design. Changes in the design will influence the architecture. Changes in the architecture will need new prototypes ... and we have a loop here. This loop should better be convergent and finite. All the architects need this kind of loop for the architecture to be of any practical use. The difference between good and bad architect is the speed of convergence. Bad architects will need many iterations and most of them will happen during project implementation phase. Changing the architecture during implementation is really expensive. Good architects will settle down the architecture in small number iterations and will have pretty stable basic concepts before the full-scale implementation starts. Few adjustments to the architecture during implementation are always necessary, but these should not fundamentally change the basic idea. Such projects can usually be delivered with reasonable costs.

Architecture that is not validated by implementing parts of it is just a theoretical exercise. It may be a good first step, but it definitely cannot be presented as a final, practical result. Untested architecture may be good for experiments and research, but it is almost worthless from engineering point of view.

This principle applies to standardization even more intensively than to the software architecture. Standards influence a lot of engineers. Standards can make entire families of technologies to either succeed or fail. Good standards are based on a working software. Only working software can provide assurance that a standard does not have any major flaws. IETF standards are based on a working software. That's the approach that contributed to the success of the Internet as a whole. But too many of the standardization bodies does not follow this practice. Some of us can well remember the infamous example of CORBA, but it looks like most people have already forgotten. The WS-* stack seems to be heading in the same direction. And there is one particular example that I would like to mention: Service Provisioning Markup Language (SPML). SPML defines a (web) service specified using XML schema (XSD). However, the XML schema for the current version of SPML standard is not even passing validation. It violates the Unique Particle Attribution (UPA) rule. Therefore the standard SPML schema is unusable for many implementations. E.g. Java JAX-B cannot process it and therefore it cannot be a JAX-WS service. I have seen that people that use it are modifying the schema to make it usable - but then, what's the point of "standard" there?

There is very little space for innovation in standardization process. Almost none. Innovation should happen in engineering and experimental projects and only the working results of such project should be standardized. However, design by committee is a well known and widely used anti-pattern. Avoid using standards that are not based on working software. And especially avoid creating such standards.

Posted by rsemancik at 9:49 PM in Software
Tuesday, 29 June 2010

Management summary: The Sun is setting. Some of its fortune can hopefully be saved, some of it will be probably lost. Attempt to replace practical mess with unusable dead horse seems to be futile. But shamans still favors it.

Technological company Sun Microsystems is being swallowed by Oracle Acquisitions. Sun had many interesting products in the portfolio. Being a technological company, Sun has failed to capitalize most of them. I've heard that Sun will never miss an opportunity to miss an opportunity. That describes the Sun's business strategy quite well. Nevertheless, Sun had some relatively good technologies. Software technologies. Even some successful products. Now it seems to be quite clear that most of them will be killed under Oracle Acquisitions rule. One company is bold enough to try to save some of the Sun's portfolio. Fortunately Sun has made some of its products open source, so there is some chance they can be saved. This can be an interesting precedent. Can open source strategy give customers better continuity than over-priced commercial support agreements? I wonder ...

Our company was a Sun partner and most of our team worked with Sun technologies since the dawn of the Internet in our post-communist country. Last 6 years we have specialized in identity management. Sun Identity Manager (aka Waveset Lighthouse) was quite a successful product. It has surprisingly high market share. However it was quite a mess. It was really wild mix of ingenious ideas, bad decisions, layers of duct tape and engineering dead ends. But it was understandable, extremely flexible and has the right set of development tools. And it worked. Mostly. The decision of Oracle Acquisitions to kill Sun's products was quite a hit. It hit us as badly as it hit our customers. Oracle is proposing its own product as an alternative. But IDM solutions are usually heavily customized. Anyone claiming that migration to a competitive technology will be easy is speaking from ignorance at the very least. The migration means reimplementation. All the effort and all the costs "invested" again. Our customers were left in the void by the decision of Oracle, therefore we were "motivated" to look around for alternative solutions.

The natural choice was Oracle Identity Manager (OIM), the thing that Oracle acquired with Thor few years ago. However, the product is unusable. That's how it is, plain and simple. The product itself may not be bad, but it is extremely difficult to customize and use. In fact it may be easier to implement IDM solution from the ground up than to customize OIM. My personal estimation is that IDM project with OIM will require at least three times more effort than similar projects with Sun IDM. OIM has really minimalistic tools, terrible debugging and diagnostics mechanisms, no build system, minimum deployment support infrastructure. Engineering nightmare. As I was staring at the monitor in disbelief and trying hard to figure out how that Thor beast works, an old Dakota tribe saying came to my mind: When you discover you are riding a dead horse, the best strategy is to dismount. Therefore I've dismounted.

Now I can understand why OIM has almost no deployments in our region. However, I do not understand why analysts put OIM in their magic quadrants. OIM surely provides lots of cool features in the marketing slideware presentations. But once it gets to real installations, it takes so much effort to implement simple things that the effort to implement something moderately complex will be counted in man-decades, if not man-centuries. Is that the "ability to execute"? A friend speculated that analysts never install. And after the experience with OIM I'm quite inclined to believe that. One way or another, my confidence in the analyst's product assessments suffered considerably.

The experience with OIM, however bad it was, provided a valuable lesson: usability, usability, usability. The product may be based on unique ideas, innovative technology, it may be a work of real genius - however all of that is vain if the product cannot be understood and efficiently used.

Posted by rsemancik at 11:36 PM in Identity
Friday, 7 May 2010

From time to time I win the privilege to develop some code. I usually try do that using the same tools and environment that the development team is using. That way I can feel all the troubles that the developers are through and maybe I figure out how to make the work more efficient. Therefore I experienced a few development environments during my career all the way from vi to Eclipse. And that's exactly the damned thing I want to rant about today. No, it is not vi. Eclipse, that's the one to blame. I've used Eclipse several times before. It worked, but it actually never worked perfectly for me. There were some glitches all the time. But few days ago it culminated. To make long story short: I wasted many hours to make Eclipse Galileo work on Ubuntu with Subclipse and Maven2 plugins. I have failed. It just does not work. But the time was not entirely wasted, as now I can make really a bad example out of Eclipse.

Eclipse is multi-platforms system. There are flavors for Windows, Mac and Linux. Yet Eclipse Galileo SR1 somehow did not really work on my Ubuntu Linux. I could click wherever I wanted, but sometimes it just did nothing. Maybe that is a hidden usability feature that makes programmers think and not just blindly click? Or maybe to train them to use keyboard instead of mouse? Anyway, it makes this specific version really useless. Moral: If you claim you have multi-platform system take the time to really test it on all supported platforms.

The way how Eclipse is composed from plugins makes it very flexible. Just pick and choose plugins as you wish. But it also makes the system very complex. There are uncountable*) combinations of eclipse core and plugins, too many ways how they can influence each other and too many things to go wrong. And the result is that something goes wrong most of the time. The good news is that glitch is usually negligible on mainstream platforms, but users of less popular platforms usually suffer. Moral: Don't make your system too flexible. You will not be able to test it and to maintain it. If you pass a reasonable amount of flexibility, user satisfaction will go down instead of up.

Eclipse Galileo (that is the most recent version) for Java Enterprise Edition does not come with support for Subversion and Maven. If you would start a new Java project today, what would you choose to build it? It won't be make and probably not ant. I could understand that Maven support was not part of core distribution 5 years ago. But now it is a must for a development environment. The same applies for Subversion support. It is probably the most popular version control system ever, yet Eclipse does not come with a support for it. Yet, you can install them as plugins ... but ... see above. Moral: If you must have flexible system, make sure that it comes with a reasonable initial configuration. Flexibility is a difficult concept and it is extremely hard to diagnose. Make sure that your users will not experience problems caused by flexibility before they know your system well and can deal with the problems.

Now let's have a look at competition. Get recent NetBeans, download it, install it and - surprisingly - start developing. There's not only support for Subversion and Maven out of the box, there is also a very reasonable set of Java EE wizards and plugins. You can have a skeleton of pretty complex Java EE project completed literally in minutes. Productivity and usability, that should be the primary focus of development environment. It is not engineering to make something work. That's science. Engineering is to make something work better and especially at least a dollar cheaper than the next best competing company. Productivity is essential.

*) Figuratively speaking. They are in fact countable and even finite - as one of my favorite professors commented when I used that phrase during my dissertation defense.

Technorati Tags:

Posted by rsemancik at 10:25 AM in Software
Tuesday, 6 April 2010

GET, PUT, DELETE and POST are four basic operations of HTTP. Many people think these are operations of REST, however Roy Fielding does not mentions them in his dissertation where REST architectural style is defined. The only constraint of REST regarding operations is the constraint of uniform interface (which by itself is problematic). However, these four operations are the de facto uniform interface of the Web. Having unsatisfactory definition this interface is frequently misused. And no wonder ...

Today I want to talk about PUT operation. Current definition of PUT makes it almost unusable. PUT is used for resource creation and modifications. Both use cases are problematic.

The problem with resource creation is that the client invoking PUT operation must specify resource URL. But how could client know the correct URL for a new resource? The server should maintain the URL namespace, not the client. The URLs should be cool, should not be reused and should follow application conventions. Believing that the client will maintain URL namespace consistent is like believing in any of The Eight Fallacies of Distributed Computing. The clients are out of control. They can be buggy, written by ignorant coders or just outright malicious. Server should maintain the order in URL space. But PUT operation does not allow that. It is placing the responsibility for URL creation to the client. Oh yes, the server can check client's request and enforce correct URL form. But such approach has many drawback. Firstly, it may be much more difficult to check URL than to assign it. Secondly, the URL assignment logic needs to be in many places (all the clients and the server). Thirdly, the client must be able to detect URL assignment problem and re-try which complicates the client. And it may not be able to succeed at all if client's and server's URL assignment policies are incompatible. And lastly, the URL assignment may take many round-trips. This makes PUT operation almost useless for resource creation. How much easier it would be to make server responsible for URL assignment?

If PUT is used to update a resource, it is assumed that the body of PUT request is new absolute state of the resource. There is no locking, neither optimistic nor pessimistic. There are no mechanisms that would enable consistency. Therefore you just cannot have consistency with PUT. Don't we need consistency in an open world-wide distributed information space? In fact we do need it, especially now as the Web becomes "writable". Yes, the server could still check the acceptability of new resource state, e.g. by making resource version part of resource state and checking it on each PUT (let's pretend for a while that mixing data and meta-data is a good idea). But if we accept such approach, what is the difference between PUT and POST? The interface is good if there is nothing to remove, if there is no redundancy. If PUT and POST are the same one of them should go.

PUT operation should either change or die. I strongly recommend not to use it in its present form.

Technorati Tags:

Posted by rsemancik at 11:52 AM in Web architecture
Tuesday, 23 February 2010

Simplicity is an important architectural feature. Simple systems are easier to understand. Simple component has fewer moving parts and therefore fewer reasons to break. Simple systems are easier to change and maintain. Yet our world is a complex thing and we, naive human beings, are often making it even more worse. We are creating legislation that no common citizen can understand unless he spends at least 5 years of his life attending a law school. We are creating all kinds of rules and regulations starting ranging from non-formal social norms all the way to multi-national treaties.

Software systems are here to help us deal with the complexity. But that necessarily makes them complex. It is not the programming language or operating system that makes software complex. Novice programmers can handle technological problems quite quickly (ever seen "Teach yourself Java in 21 days"?). The difficult problem is not "how to build it". The real problem is "what to build". The environment, user expectations, past and future - these are the primary sources of complexity.

What happens if you try to solve complex problem with a simple solution? It may work - if you are genius and you figure out something that generations before you missed. But honestly most of us are not geniuses. More likely outcome is that a simple solution for a complex problem will not work. It will break in spectacular and unexpected ways. Why is that?

You start to analyse the problem, find the important pieces, the "essence". Once you have that you design it, implement it, test it, reimplement it, test it, deploy it ... and you discover that the "essence" was not enough to solve the problem. I mean the real problem of human users, not the usual problem of "how to pass acceptance testing and get the money". Therefore you rethink the problem, discover that the essence is much more complex than you expected. Most project now just start to complicate the think that they already implemented. The system get more and more difficult to develop and no amount of refactoring seems to be enough. The system more and more looks like some devious creation of Dr.Frankenstein. An abomination. Such project will run high over the budget and miss all the deadlines. And project goals change as well. The original goal of solving the real problem of system users is quickly transformed to "just finish it, get the money and get out of here".

Initial definition of the problem and scope of the project is essential. The specification needs to have a correct breadth. Specification of a problem that is too complex will make the project extremely expensive just to find out that users are using only 20% of system functionality. Specification of a problem that is too simple will lead to a system that fails to satisfy the users and is unusable in practice. The specification also needs to have a correct depth. Waterfallist-like 1000 pages of analytical documentation is a pure nonsense. Nobody will read it and it creates analysis-paralysis situation. Agilist-like one-liner "the system must work" will not do either. It will send the project into an endless loop of refactoring, competing project goals and divergence.

Small problems, inaccuracies and omissions in initial project specification are easy to fix during the project. But if the breadth and depth of the specification is wrong, the project is doomed to failure from the very beginning. If the goals are wrong is it a success if such goals are reached?

Posted by rsemancik at 9:16 AM in Software
Friday, 29 January 2010

Quite an interesting scam appeared on Facebook. It was just a matter of time when something like that will pop up, yet I was quite surprised when I have actually seen it. The scam works like this: There is a simple HTML page that promises to provide nude photos in zip file if you click on the button. However, if you click on the button you will see no butts and tits. A link to the tricky page will be posted to your facebook profile instead. If you want to try it go to http://homeslices.org/f2.html (if the page is still around). But you have been warned.

The trick is simple. The page creates an iframe containing pretty standard facebook form to share a link. However the frame is almost invisible, therefore you cannot see it. But the browser still think you can see it and is processing it. The tricky page has a "View" button on the same location as is "Share" button on the invisible facebook page. You think you are clicking on the "View" button but instead you are clicking on the "Share" button on facebook. The iframe is fetched by your browser, therefore it is your identity that is used on facebook to post the link.

This page is pretty innocent. All it does is a bit of humiliation for the victims, amusement for experts and undoubtedly a lot of fun for the author. But imagine that this very same method is used to subvert your Internet banking. I guess that the method could be adapted to subvert many of current Internet banking applications. It won't be that funny any more.

This is the price we pay for flexible presentation formats. There are two basic principles of the trick:

  1. Mix the content from two sites in one window. Content from facebook is displayed in a page where you do not expect it, it a wrong context, with a wrong URL in the URL bar.
  2. Create ambiguous display of information. The browser thinks you can see the "Share" button. If has 1% opacity, therefore it is still somehow opaque and ergo visible. Therefore it thinks that if you click to the place where "Share" button is you want to submit information to facebook. But in fact you do not see the "Share" button because if has only 1% of opacity and therefore is almost invisible. You are clicking to that area because you see "View" button that is behind it.
The first problem is a specific problem of HTML. It can be fixed quite easily, if there would be enough "political will" to do it. But the second problem is the problem. How much opaque something should be to be considered opaque enough? Should 1% grey text on white background be considered visible? Or can a 2pt big font be considered readable?

Probably the most serious implication of this problem is a bit independent from a Web. Presentation formats are very dangerous when used it a legally binding way. For example if you sign a document with a digital signature. If you sign a contract and it contains a paragraph written in light grey text on a white background, should such a text be considered part of the contract or not? Some devices may display that text as well readable while on some devices it cannot be seen. This opens up a huge door to a scam of all sizes.

This problem applies universally to any data format that includes rich presentation features: HTML, Microsoft Word documents, RTF, OpenDocument and many more. But maybe the worst aspect of all of this is that our government as well as many other governments in Europe explicitly allows such data formats for legally binding documents signed by "guaranteed digital signature". I'm really lucky that I have no qualified certificate to create such a signature.

Technorati Tags:

Posted by rsemancik at 7:05 PM in security
Tuesday, 26 January 2010

SOAP is one of the prominent protocols for remote procedure invocation (RPC). It can do more than that, but it is used almost exclusively for RPC. More specifically it is used for RPC across the Web, both internally and externally. It is used on the Web so frequently that most people working with SOAP do not even realize that it can be used without HTTP and in non-RPC way.

SOAP by itself is quite simple XML-based message format. However it is accompanied by army of profiles, recommendations and especially a set of so called WS-* specifications. That creates a "SOAP stack" that is quite complex. This labyrinth of specifications is an attempt to solve the qualities of SOAP such as security, reliability, addressing, distribution of policies, etc. It makes SOAP quite a flexible mechanism. But ...

Flexibility does not come for free. Until very recently the price was paid by suffering all kinds of interoperability problems. It was so severe that a special organization was established to improve interoperability. Now basic SOAP implementations work together acceptably well, but the situation is not that good for various WS-* extensions. It will take a lot of effort to make implementations fully interoperable. But that is not a problem of SOAP itself. It is an inherent cost of complexity and distribution.

SOAP is not the first "fabric" of distributed systems. There was CORBA before SOAP and Sun-RPC before CORBA to name just a few of many existing mechanisms. However, the designers of SOAP failed to learn from the past. The intent of SOAP was to simplify things. But SOAP stack is now almost the same complexity as CORBA was ten years ago. SOAP is XML-based and with HTTP it can pass easily through firewalls (that are broken anyway). But that's almost the only advantage over CORBA. And now about the drawbacks ...

The most serious failure of SOAP design is the lack of support for object orientation. SOAP is not about invoking methods on objects, it is about invoking operations of static services. Objects cannot be arguments in SOAP messages, cannot be returned from operations and there is no support for object references. All of that was a fundamental part of CORBA, yet there is no concept of objects in SOAP. In fact it is an odd joke to call it Simple Object Access Protocol - as it is definitely not about objects and either not simple or not a protocol (depends on your point of view).

SOAP is also not outright compatible with World Wide Web architecture. Web is based on REST style that defines few basic operations that should be common for all services. SOAP services can use arbitrary operations without any link to the operations of REST. REST architecture also naturally assumes object orientation - web resources are (almost) objects. SOAP does not deal with objects at all. Therefore applicability of SOAP on Internet scale is quite a controversial topic.

SOAP is good in the enterprise and in quite closed environments where interoperability can be assured by testing. SOAP with WSDL has quite a strong interface definition mechanism. It is a rare trait for a technology born on the Internet and it is a necessary condition for composing complex systems. SOAP is also almost the only option for integration, as CORBA is dead and asynchronous mechanisms are seen as too complex and unnecessary by buzzword-driven integrators. If we will be lucky enough, SOAP may eventually get to the state where CORBA was a decade ago ... I can almost hear the melody ... just little bits of history repeating.

Technorati Tags:

Posted by rsemancik at 9:34 AM in Software
Monday, 25 January 2010

RESTful web services are seen by many (especially young) developers with almost religious awe. Such services are built using standard HTTP protocol with usual HTTP methods as operations. RESTful web services have no arguments, they GET, PUT, POST and DELETE resource representations. The resources are identified by URLs that are also used for links among resources. Such an approach requires a fundamental change of mindset when compared to a more traditional RPC-style of building services. But that is not really a problem: most simple services can be acceptably well modeled using the RESTful approach. The problem is not in the functional aspect.

The problem is, as usual, in the tricky non-functional aspect. Web services are mechanism for communication between computers, but the Web was designed for human-to-computer interactions. Many issues appear from the blue if the Web is used for something that it was not really designed for. Let's have a look at security aspect of RESTful web services as an example.

It is difficult to authenticate invoker of the service to the provider. There are two authentication mechanisms for HTTP (basic and digest), but these are design for interactive human-to-computer authentication. HTTPS in mutual authentication mode provides another solution. This can be non-interactive, but is quite hardcoded to X.509. Under normal circumstances it can authenticate two sites to each other. What would a service need is to authenticate user to the site. If you want to authenticate user on the client side to server, you can still do that with somehow non-typical use of X.509. In that case each client site must be a certificate authority. However as certificate constraints are not well supported, root certificate authorities are not likely to issue certificates that allow creating subordinate certificate authorities to clients.

But even if HTTPS/SSL/X.509 can be fixed, it will most likely not solve the problem. I doubt that X.509 can be flexible enough to support broad variety of security schemes that Internet-wide technology requires. And the flexibility comes with a cost: interoperability. The people working with enterprise PKI know how difficult is to achieve interoperability of different X.509 implementations, and that is miles away from Internet scale. There was only a slightly improvement in two decades of X.509 existence therefore there is little hope that X.509 will be the right solution for the Internet.

There are (relatively) new security mechanism out there, but these apply more to the RPC-style web services. WS-Security and SAML are good examples. WS-Security specifies a header to SOAP request that contains security credentials. SAML specifies protocols and security token applicable for various scenarios, including Internet-scale single-sign-on and federation. However it is difficult to use SAML with RESTful web services. SAML tokens are usually many lines of XML code. In SOAP there is a place for the token in message header, but there is no such place in HTTP. I don't think that placing few kilobytes of XML data in custom HTTP request header is ideal solution. If that would work at all it will be a non-standard hack. And there is no other place in HTTP GET request for such data. There is a way how to shorten SAML token into a few bytes of SAML artifact. But artifact resolution requires additional round trip. In fact several round trips as a new TCP connection (and most likely also SSL handshake) is usually required. It also requires active client being able to listen for connections and maintenance of state on client side. There is also a question how to pass the artifact to the server. The usual way of putting that in the query string is a violation of REST principles, therefore the result will likely be non-standard solution or broken architecture.

The situation is quite similar for many other non-functional aspects. It is difficult to guarantee consistency, atomicity and coordination of RESTful web services (e.g. make them part of a transaction). As URLs are both service endpoints and object identifiers, it is difficult to move service without breaking compatibility. There is no practical interface definition language and interoperability guidelines. Each definition of RESTful service is a free-form text for humans to read and implement with very limited possibility for code generation ...

I'm not trying to tell that all that is RESTful is useless. Both REST and RESTful web services can be very useful, especially with services that shoot for Internet scale. RESTful web services undoubtedly have many advantages but also many limitations. Standard RESTful web services are not yet ready for anything but very simple public services - for that RESTful solution could be ideal. However RESTful approach fails if service quality is important. Custom non-standard solutions can help a bit, but these have their own dangers, especially if the goal is to create interoperable Internet-scale services.

Engineering is not religion and technologies should be assessed with sceptic eye. An engineer that designs anything RESTful should be well aware of the limitations of REST and Web instead of blindly following the hype.

Technorati Tags:

Posted by rsemancik at 12:43 AM in Software
Monday, 18 January 2010

The world is not an objective place. There seems to be no single point of view, no absolute truth. There is only a little piece of information that could be regarded as reliable - an information that is well summarized by the famous cogito, ergo sum. All the rest is, more or less, speculation.

Consider some quite distant land, for example Antarctica. Have you been there? Have you observed it personally? Most probably you haven't. All you know about Antarctica is second-hand information. They say that strange birds that cannot fly live in Antarctica. Penguins, that's how they are called. Would you believe that? Yes, you probably would. Have you heard that Yeti recently moved to Antarctica? Would you believe that? You probably wouldn't. Both "here be penguins" and "here be Yeti" are information. These are not facts, but mere information. It is the belief that makes them into facts.

But even things and phenomena that you personally observe cannot be automatically regarded as true. Think of David Copperfield and Statue of Liberty. People have witnessed how the statue disappeared. Yet, if you were one of them, would you believe that huge steel-and-copper statue has really ceased to exist for those few moments? Probably not. How many times have you seen pretty ladies sawn in half, disassembled into pieces and reconnected again or levitating freely in the air? What we see may not be what it appears to be. I'm sure you will be amazed by this excellent performance by my favorite duo Penn and Teller.

Think about your date of birth. Do you think that the date of your birth is an unquestionable fact? Not really. You were there when you were born, but you probably do not remember it. And you was quite incapable of checking the date for yourself at that time. Therefore you date of birth is just an information. It comes from quite a trusted source but, strictly speaking, it is not unquestionable.

Any information must be regarded with an appropriate level of confidence. You will probably not really doubt your date of birth, therefore the level of confidence is very high. You believe in that information. But you will probably doubt that Yeti lives in Antarctica (everybody knows that Yeti lives in Himalayas). Therefore a level of confidence for that information is low. You do not believe in that. However, you may slowly increase the level of confidence as more and more expeditions will report encounters with Yeti in Antarctica. As it goes beyond a certain threshold you may start believing that. And once the popular press brings a convincing evidence that what was considered to be Yeti was in fact a mutated giant rat from Mars transported to Antarctica for the sole amusement of penguins by the four headed hyper-intelligent lizards of Sirius IV, you may quite stop believing in Yeti.

Seems pretty obvious, isn't it? Now how is it related to software?

Software is all about information. However, overwhelming majority of software systems have no ability to be "somehow inclined to believe in Yeti" or "quite doubt that Yeti has moved recently". Most software systems have only one level of confidence: fact. That was not a problem when the information systems were small and disconnected. A user working with a specific system was somehow aware what is the reliability of information coming from that system. The user either knew how the system worked or slowly learned how reliable the information is by confronting it with reality. The user as a thinking human being is correcting inability of computer system to deal with uncertainty.

But such a simple approach will fail in case of global distributed hyper-connected information super-highway such as the Web or Semantic Web. Users don't know how the displayed information was acquired and processed and usually have no time spending few years confronting the information with reality. Users of the Web have no way how to asses reliability of information they see. The simple binary model of true-false will not work in this environment. Any system using such binary model that includes computer-to-computer communication on a large scale is doomed to failure.

I quite believe that the future is not really bright for Internet-scale web services and semantic web. Unless they can learn how to doubt.

Technorati Tags:

Posted by rsemancik at 1:29 PM in Software
Wednesday, 13 January 2010

I was surprised to find out that not many people can create good abstraction. Many people are good in thinking about concrete objects and problems, but only a few of these can think abstractly. We in the software industry are forced to think abstractly from the very beginning, as software itself is somehow abstract. However when it comes to creating higher-level software abstractions, people often fail.

Interfaces are probably the most significant abstractions in software. Interfaces are formed from programming languages constructs, network protocol messages, states and sequences, signals, file formats, XML tags and many other elements. Interfaces provide a basic mechanism that an architect can use to exercise control over the system. Interfaces are powerful tool to contain change, to enhance reusability, to make the system more understandable and manageable. Yet too many interface definitions are weak, imprecise, incomplete or outright misleading.

During the course of several years I found myself gradually compiling a list of items that need to be included in a good interface definition. Recently I have found the time to put it into a document, add some explanation and examples. The result is here:

Interface Definition, Guidelines and Recommendations

I have decided to publish it under Creative Commons Attribution license (CC-BY) so you can freely use it in your project as long as you give me a proper credit. I recommend you to copy and paste parts of the document to create a guidelines suitable for your project. I hope that this helps many people to improve the skill of creating abstractions.

Technorati Tags:

Posted by rsemancik at 3:57 PM in Software
Friday, 8 January 2010

The Web and especially OpenID has yet to learn important lesson: nothing is permanent. Will Norris mentions it in his post. To make his long story short, the problem is that OpenID relies on DNS and DNS names can be reassigned. With change of control of DNS name the control of associated OpenID identifier is changed as well. Therefore a user may be required to pay for a domain that he does not want any longer just to avoid losing control over the OpenID identifier. The root of the problem is that DNS is not really an identification mechanism, but rather an addressing mechanism. OpenID design does not account for that.

The purpose of address is to locate an object, therefore it contains information about object's location - directly or indirectly. Address must change if the location of the object changes. DNS is using a level of indirection to reduce the number of changes needed if object location changes, but it does not reduce them to zero. You may be forced to pay for a domain forever if you want to make DNS name a permanent identifier - assuming you can do that at all. For example the rules for sk top-level domain will force you to yield your domain in case someone registers a trademark that is the same as your existing domain name. Therefore making DNS name persistent may be quite costly. DNS domain is an address. Get over it.

The purpose of identifier is to distinguish the object from other similar objects. Well-designed identifiers does not need to change. The identifier may identify an object that does not exist any longer, but it should never identify a different object. Think of ANS.1 OIDs, ISBNs or similar identifiers. For identifiers to be efficient their assignment should be very cheap and maintenance must be extremely cheap or entirely free.

It is not wrong per se to use address in your system. But it is a mistake to use an address and assume that it has properties of identifier. It is a failure to assume that address will not change - almost as serious a mistake as assumption that identifier can always be resolved.

Technorati Tags:

Posted by rsemancik at 6:42 PM in Identity
Friday, 11 December 2009

There are lots of lots of sites that horde and gather content on the web. Sites that offer you to maintain a photo album, video collection, bookmarks and whatnots. Each and every such site tries to gather a community of its own. How could you tell apart the sites that are worth your attention and the sites that would mean just plain waste of time? How you could see whether there is healthy community or just a bunch of uninteresting loosers?

I have figured out a three-seconds test that seems to work quite universally. Just go to the site and use the search input field to search for some controversial topic. I usually search for "nude". If the search results are just porn or a horde of flame-infested discussions, the site is uncontrolled wilderness. Avoid that site. If nothing relevant turns up or you can see just some carefully censured bikini shots, the site is too conservative to be useful or entertaining. Avoid such site as well. If the search results show decent selection of artistic nudes or some good texts on nudity, it is worth the time to explore the site further.

Technorati Tags:

Posted by rsemancik at 4:31 PM in misc
Tuesday, 1 December 2009

World Wide Web Architecture, and the REST architectural style as well, deal with resources. Resource is one of the central concepts in the web. Web pages are just representations of resources, resources are identified by URIs, the web is all about resources. But what is a resource? Now, that's a mystery.

The World Wide Web Architecture document provides quite vague and indirect definition:

By design a URI identifies one resource. We do not limit the scope of what might be a resource. The term "resource" is used in a general sense for whatever might be identified by a URI. It is conventional on the hypertext Web to describe Web pages, images, product catalogs, etc. as “resources”. The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message. We identify this set as “information resources.” [...] However, our use of the term resource is intentionally more broad. Other things, such as cars and dogs (and, if you've printed this document on physical sheets of paper, the artifact that you are holding in your hand), are resources too. They are not information resources, however, because their essence is not information. Although it is possible to describe a great many things about a car or a dog in a sequence of bits, the sum of those things will invariably be an approximation of the essential character of the resource.
That means that anything can be a resource. Dogs, houses, books, specific version of a book, specific paper-based copy of a book, photograph of the book, files containing data scanned from that book in pixmap format, data containing content of that book in ASCII format, HTML-formatted content of that book, the web page that contains the HTML formatted content of that book and even web page describing that book in an electronic shop - all that could be resources. But wait, isn't a web page containing HTML-formatted content of the book in fact a resource representation? Yes, it is. And many of the objects and concepts mention above may be resource representations. And they may, at the same time, be themselves a resources. In fact there seems to be no difference between representation and a resource (maybe except for non-information resources). The world of web in not black-and-white with abstract resources and concrete representations (as it seems to be at least partially assumed by REST). There are many shades of gray between abstract and concrete. And maybe the pure abstractness and pure concreteness are just theoretical extremes that cannot be reached in practice. Such a fuzziness of meaning is one the most difficult parts of Web architecture to understand.

However, allowing real-world things to be resources make a awful lot of problems. The panorama of these issues starts with the problem of who is authorized to assign URI to star known as "Sirius" (as it obviously can be a resource and it should have a single URI). Then it goes through a problem of completeness, as it is quite difficult to imagine that an "information resource" would capture all aspects, characteristics, feature and (potentially conflicting) viewpoints that concern a specific real-world thing. Many more problems follow and I'm sure we do not yet see most of them. I've tried to capture the obvious problems in my paper. Semantic web activity is trying to address some of these issues, but so far it seems that the result is to make the problems machine-processable and efficiently distributable to Internet scale. I have seen no real solution so far.

Therefore I have proposed to limit the definition of resource to only include so called "information resources". The information resources may indirectly refer to the real-world things and concepts, but Web in fact does not need (and cannot) deal with the real world directly.

Technorati Tags:

Posted by rsemancik at 2:41 PM in Web architecture