Proper Case Format Provider for IdM

Today I am posting yet another custom format provider suitable for Identity Management projects. This provider comes with a little story; I hope you’ll enjoy it.

Cult of personality

The majority of identity management projects are centered on a person’s identity as opposed to on a computer, a printer or any other inanimate object. Therefore administrators are quite often presented with familiar sets of biographical data attributes such as first and last names.

By nature of identity management project, any change made to the synchronized data is replicated and propagated to many data-sources; as a result that data gains greater visibility. In the same time by nature of biographical data it is a matter of personal vanity. The combination of those factors could be very politically charged and consequently could produce a serious hindrance to the seemingly pure technical project.

The code that I am about to present was conceived during one of my IdM engagements where I was facing a challenge of reading data from a legacy mainframe system and piping it out to dozens of directories around the enterprise.

"Legacy" is another way of saying old

Mainframes, with their heavy-lifting capabilities, have an important role in modern enterprise. We call it "legacy". We have a whole industry supporting "legacy" and making it more open and available to current trendy technologies. "Legacy", almost by definition, is resistant to change, and consequently there is whole other side of the industry that is working against the trend of opening it up; do you remember Y2K projects? Good times! Good times! I think that was the time when word "legacy" settled in the every-day IT vocabulary.

The problem that I’ve personally faced, while interacting with "legacy", was rather simple to describe. Whole HR system user’s biographical information was stored in all CAPS. (Visualize Homer Simpson’s – DOH!) There is nothing wrong with this, if you want to represent SCREAMING in computer writing, but it was not quite acceptable for re-usage throughout the whole enterprise. System admins of eDirectory and AD didn’t like the idea of overwriting their carefully formatted -by- hand data. So, Mr. Homer Simpson in AD was stored as SIMPSON, HOMER in mainframe. Big deal! How hard is it to lower the string and capitalize first letter of each word? Right? Wong!

Doh!

C’est l’histoire De ‘Monique

As I’ve thought at the time, everything was going very well. My code was lean, easy to understand and performant. However after closer examination I have found high-up positioned person with last name De’Monique (name was slightly modified to reflect essence of the problem). Why was De’Monique a problem? Let’s take a closer look at the algorithm I’ve originally proposed. Take the string "DE’MONIQUE" and lower it, which will result in "de’monique". Now take the first letter of the string, which is "d" in this case, and capitalize it to make it "D", append it back to the rest of the string which will result in "De’monique". Doh!

Mrs. Big Director De’Monique was not very happy about lowering her name to a mere mortal sloppy formatting by some piece of code chomping string away somewhere in the guts of the IT machine. With the name like De’Monique she disserved special attention! I also received a protest from a Mr. Van Der Problem, Vice President; whose name became "Van der problem", which is inappropriate for a person of his caliber. As you can imagine there are plenty of hyphenated names that will also suffer from blunt lowering and capitalizing of the first character of entire string. So the opportunity to enhance my code presented itself (once again). I had to write something more clever than simple first letter capitalizer.

So the method of splitting the string on spaces and processing it, as well as splitting it on apostrophes and hyphens with recursive call to the capitalization routine was born.

Going "all-in"

Once I figured out the method to split the string on its components and analyze each component individually, I was in a better place. Mrs. De’Monique was happy; Van Der Problem wasn’t problematic any longer. My formatter was lean and mean formatting machine ready for the primetime. Life was good.

Have I mentioned that ALL VALUES IN HR MAINFRAME SYSTEM ARE STORED IN ALL CAPS? It was (and probably still is)! My inner-geek was very angry at the person who decided that ALL CAPS is the way to save space on the magnetic tape in 1963. So what did I do? In my infinite wisdom I’ve unleashed my mean reformatting algorithm on all values of HR data-source. I bet you already can feel that there will be a problem with this decision?

Degree of Vanity

I don’t know about you, but personally I am not a big fan of a traditional school-going; even thinking about years and years of going to school making me a little anxious. I can appreciate people who did pay their dues and stayed in school for a long time. Perhaps they didn’t really know who they wanted to be when they grew up, or maybe they loved that one particular field of study so much they had to get their PhD. I can see that when you’ve paved your way to the title of "doctor" you want to display it proudly and you will pick up that phone receiver and make that call to your local help-desk to make absolutely sure that you are displayed in global address list as Homer J. Simpson, PhD, and no other way.

So let’s take a look at this scenario via the prism of my lean and mean formatter. Remember I’ve attempted to parse ALL LEGACY CAPITALIZED VALUES with it. By now it could take apart the string to its compound components and properly capitalize all first characters where and when needed. So in the case of PhD, as expected, it will "flatten it" to "Phd"; Doh! …which will generate that dreaded helpdesk call… Doh!

So how can we fix this? The answer is a mysterious and all powerful RegEx. I had to write a regular expression "formula" that will look for patterns within the string and determine whether the string might contain predefined acronym such as PhD that deserves special attention. Outside of PhDs, which is a rather special capitalization case, there are plenty of other capitalization artifacts in mighty English language. So say hello to Regular Expressions:

Titles

A regular expression string that matches degrees and professional designations and ensures that they are in all caps this will match:
MVP and MCP, DSC, CNA, CCNA and CCNP, MCSE and MCSA and MCSD, CISM and CISA DDS, RN, MD and OD, BA and MA, CISSP

(^m(v|c)p,?.?$)|(^dsc.?,?$)|(^cna.?,?$)|(^c{2}n(a|p).?,?$)|(^mcs[ead].?,?$)|(^cis(a|m.?,?)$)|(^d{2}s$.?,?$)|(^rn.?,?$)|(^(m|o).?d.?,?$)|(^(b|m).?a.?,?$)|(^cis{2}p.?,?$)

First and Last Caps

A regular expression string to match PhD or LegD and any variants with periods

(^leg.?d.?,?$)|(^ph.?d.?,?$)

Roman Soldiers count-off

Roman Soldiers – count off! [ai; ai-ai; ai-ai-ai; ai-vee; vee; vee-ai…] (the joke is dry and wry because it supposed to be. It’s Monty Python, if anybody cares to know)

As you might guess by now, roman numerals in people’s names present another problem for us when formatting those names. We all appreciate tradition, so when your name is Homer J. Simpson IV it is pronounced "fourth" not "[ai – vee]"; hence you will probably want to see it displayed as "IV" and not "Iv".

So how hard is it to teach my lean and mean format provider to see Roman Numerals? Thankfully it is not too hard with some help from our already mentioned friend RegEx. Here is the RegEx formula that I’ve used to determine Roman Numerals in strings.

Roman Numerals:

A regular expression to match Roman numerals

^((?=[MDCLXVI])((M{0,3})((C[DM])|(D?C{0,3}))?((X[LC])|(L?XX{0,2})|L)?((I[VX])|(V?(II{0,2}))|V)?)),?$

Thankfully I was not asked to represent number zero in Roman Numerals.

The rise and fall of the ancient clan of MacHinist

In this article I have already mentioned De’French, Van Dutch with Der Germans and Roman (more or less) now let’s talk about two other great nations -the Irish and the Scottish. We are already OK with O’Lastname kind of names, however other Scottish and Irish patronymic surnames frequently having ‘Mc’ or ‘Mac’ prefixes appended. (Can you hear the drum-roll sound emerging?)

To illustrate let’s take a look at "MacDonald" which is one of the most popularized surnames of this type that come to my mind. The problem with that name, as you might already see, comes in the form of double capitalization without any separation between the prefix and root of the word. MacSimpson will not be very happy if you will create a display name Homer J. Macsimpson. It looks wrong, it feels wrong, and therefore it’s just wrong! I would call help-desk right away! Doh!

I had to come back and seek help in the world of regular expressions once again. The syntax that I’ve used this time was a little more tricky

McOption:

A regular expression to match the Mc’s and Mac’s of the world, but NOT MCSE MCSD or MCSA. This expression uses negative look ahead to rule out those possibilities.

^(ma?c)(?!s[ead]$)((.+))$

 

MacSimpson case was successfully resolved! However, do you remember that I was running this algorithm on ALL CAPITALIZED DATA FROM LEGACY HR? Well, it is about to bite me in the rear.

Observe:

First Name: HOMER — > Homer
Last Name: SIMPSON –> Simpson
Title: MACHINIST — > MacHinist

I was looking at my data and wondering why am I seeing so many people with the same last name? All those MacHinists! It must be whole clan that moved in to work here (I began to wonder how the tartan of MacHinist clan looked); and why are all those last names were listed in the "title" field? W-a-ait a minute… Doh!

This problem almost entirely killed the whole re-formatting idea. When working with a pool of thousands and thousands of last names (or applying the same formatting rule to all strings) you should have last names from all corners of the planet. Spotting Irish/Scottish last names and distinguishing them from any other last names (or random words) that could legitimately start on Mac or Mc (like "machinist") is mission-impossible.

There is no spoon

The solution for this problem is not very simple; frankly I could not solve it with the IFormatProvider interface implementation alone. What I did was a two-prone approach.

Prong number one:

Proper case format provider was adjusted to have the "McOption". It could be turned on or off. That allowed me to choose between ways of capitalization of certain strings. So the "title" would not have the McOption turned on and the "last name" would. Overall introduction of McOption provided greater degree of flexibility. After separation of McOption the "proper case" capitalization worked very well, with very few exceptions.

Prong number two:

I have found handful of surnames that had to be reverted back to its manually formed capitalization. The example of that is the last name "Machado". With the McOption being turned on last name Machado would be capitalized as MacHado, which is not desirable result. My first attempt to resolve this conundrum was an expansion of the regex syntax, and then I’ve looked into creation of an exclusion list.

At this moment I’ve remembered: "Do not try to bend the spoon. That’s impossible. Instead only try to realize the truth. There is no spoon". So once I’ve realized that I really can’t predict all possible exception and formulate them in regular expression of IFormatProvider, my identity management hat was on. I hate answer "no". In fact my mantra is: "Answer to all technical questions is "Yes". Real question is "how much time you want to spend?"

By now, with the knowledge that my mean and lean ProperCase formatter is not "The One", I’ve decided to use pure IdM solution to solve this problem. I’ve created a data-source that contained handful of people with exceptions in the spelling of their name. In fact I’ve overwritten entire "display name" attribute. People like Machado, DeBeers, and others with irregular capitalization of their names would be manually added to that data-source. The attribute flow priority for the name was set the way that "Capitalization Exceptions" management agent has foremost priority. Whenever record is created in "Capitalization Exceptions" MA user is joined on the user ID and the value of "display name" flows on top of the auto-formatted value, effectively overwriting it with its hand-crafted equivalent. That allowed not only to counterweight all mis-formatting that I’ve spotted then, but also addressed an issue of future unknowns and extensibility.

Ladies, gentlemen, if you are still reading this post you are more than deserving to download the LafiProperCaseFormatProvider and use it in any way you want. I do appreciate your attention.

CodeProject: Lost And Found Proper Case format Provider: http://www.codeproject.com/KB/string/ProperCaseFormatProvider.aspx

Happy coding!

Advertisements

Dynamic Organizational Unit provisioning with ILM 2007/MIIS 2003

System administrators are often facing task of creating OU structure in a corporate LDAP directory(es), such as Active Directory, ADAM/ADLDS, OpenLDAP, eDirectory etc. In the organization where administrator is asked to place user account object in the OU corresponding to user’s department, title or any other dynamically calculated container based on the user’s attributes, (s)he must know (and therefore hardcode) values of target containers/organizational units in the LDAP connected directory in question.

MIIS 2003/ILM 2007 developer reference is rich with examples of placing user account within pre-defined OU based on the OU’s name. In the event when parent OU if not available administrator is expected to create an organizational unit object manually. In the same time, should organization extend list of the departments (and therefore list of the corresponding OUs), the provisioning code will have to be augmented to include new values (path) and provisioning/de-provisioning business logic for newly added target OUs.

To avoid this practice of re-compiling of provisioning code for every adjustment in the organizational structure of an enterprise administrator could implement a mechanism to create parent organizational units dynamically, based on the attribute values of the user object in the Metaverse.

This code example also provides clear path for de-provisioning of the user account in the future. To illustrate challenge of dynamic provisioning of OUs based on the "user" object-type provisioning cycle, we will need to understand the initial provisioning logic of the first user account that encountered the condition where parent OU was missing. Code will "detect" that parent OU is missing and it will generate the CSEntry object of "organizationalUnit" type in the target management agent. Consequently the organizational unit object will become (and remain) connected to the user (person) MVEntry object. All consecutive provisioning attempts of any other user objects to the previously dynamically-generated OU will be successful.
However problem could arise when "first" user in this dynamically generated OU is ready to be de-provisioned. Since the OU object is still connected to that user object de-provisioning routine could de-provision the organizational unit object along with the user object, which will leave all other users, provisioned to the same OU, without a "parent". To avoid this unwanted condition provided code example disconnects an "organizationalUnit" object from "user" object during next synchronization cycle of the Sync Engine. It is important to make sure that your configuration is not set to leave disconnectors of the "organizationalUnit" type as "normal disconnector".

Administrators are strongly encouraged to review de-provisioning logic for all types of objects while implementing this dynamic OU provisioning routine.

You can find code for the "Dynamic Organizational Unit provisioning with ILM 2007/MIIS 2003" here

Phone Formatting

In many of my Identity Management (IdM) projects I am facing predicament of "dirty data". The term of "dirty data" is used to describe incorrect or misleading data residing within a data-source.

Self-service data-sources (such as web-portals, phone directories, etc) are the biggest producers of inconsistently entered data, which is understandable in the scenario when any user is allowed to modify his/her data manually with little guidelines and data verification(s).

There are many deferent types of user-provided data Identity Management professional will face; one of the most common data types that is "outsourced" for entering to the end-user is a user’s phone number(s). In the end all synchronized data sources could consume that data, which could lead to difficulties in processing, if/when application(s) expecting more consistent data format.

Here in North America we are lucky to have uniformed phone numbering plan (from programmer stand-point) , known as North American Numbering Plan (NANP); NANP makes parsing of the phone number relatively easy; This article covers only North American phone number format and does not attempt to parse any other formats for any other phone systems. Direct application of this custom format provider to other types of phone numbers could result in unpredictable results. However you can extend this code to process other types of the phone numbers by adding methods that would recognize formats of the phone numbers specific to your local phone system. Good example would be French phone system, which is persistent in its numbering rules and therefore can be quantified by format provider relatively easy.

In the library that I’ve posted on Code Project you can see how you can brush-up the phone number that was not stored in uniformed manner.

You can find the Lost and Found Identity – North American Phone Formatter here

Starting my Blog

Ladies Gentlemen

After years of silence I’ve decided to dedicate some time for my blog. It’s new and painful experience for me… LOL

I’ll be posting some of my Identity Management related articles here. After years of working on IdM arena it’s time for me to post some of my work and make it public.

I’ve already created several projects on CodeProject server and will be referencing some of that work here.