I realized a common analogy would be a useful tool to help non-engineers and non-developers… why not a car? A modern vehicle is a feat of engineering. Rolling down the highway at 70mph means all sorts of vibrations and harmful oscillations must be damped out to prevent the vehicle from destroying itself. Modern vehicles like my truck actually have a LAN-like network called a CAN(Controller Area Network) that connects everything from the transmission to the radio on a single digital bus. A lot of research and development goes into making a vehicle, so how does that compare to software development? Engineering a vehicle has many phases, the early stages being definition, planning, and experimentation; lets focus in on a few of these.

Definition

The first step of any development process should be definition, or creating a vocabulary for the problem space. The goal is to make sure everyone is talking about the same thing. For instance, in web application development, a user and a customer can be different people!

While designing a car, there is driver, passenger, steering wheel, e-brake, and many more. These terms must be used consistently in the development lifecycle. Ambiguities must be sought and destroyed. For instance, is a driver also a passenger? Or perhaps driver and passengers are both occupants? This critical step is glossed over by Agile methodologists that want to jump right into coding.

A great (and free) resource on how to create a business vocabulary is Domain Driven Design Quickly, by Eric Evans. Eric steps through the process of creating an air traffic control system. Nearly an entire chapter is spent defining the actors, nouns, actions, and relationships in the air traffic control ecosystem before introducing any DDD concepts.

If you’re a non-engineer, you should demand your developers create this vocabulary with you. As you’re painfully aware, communicating what you want with developers is difficult. Creating a vocabulary is a communication channel. Do not skip straight into development or design without it!

Planning

When GM had automotive engineers design the Chevy Volt, it’s unlikely they that put a bunch of sticky notes on a board that said, “A user should be able to steer the car in a particular direction,” then ran off willy-nilly and built out the steering rack from modern hydraulic actuators. Later on, in sprint 25, when they got around to building the engine, they realized that a hydraulic steering rack isn’t ideal for an electric car. Turns out, hydraulics require a constant pressure from a turning pump, and an electric vehicle’s motor isn’t turning at red lights.

They didn’t spend 8 years drawing on paper either, then started building the car. Each time they designed a component of the vehicle, they had to change the plans for the rest of the vehicle. Their design was iterative and if you look at the early prototypes, they look nothing like the final product.

Neither extreme is healthy and unfortunately Agile developers tend to want to get started quickly without thinking of the big picture. Old school waterfall developers like to think that once they develop a plan, they should never ever change it, and also think that they can foresee all the problems that would arise with a particular design.

Software engineers should fall somewhere in the middle. After user stories are somewhat complete, a domain model should be developed that shows object attributes and how they relate to the rest of the domain. After the domain design is created, an architecture plan should then be developed that shows how objects stored, processed, and shuffled around the system. Cross cutting concerns, such as logging, monitoring, and auditing should be introduced at this point.

Unfortunately, there is not easy road for non-engineers here. Your developers should have a technical plan, and they should be able to explain it to you in a way you understand. Each iteration the technical plan and model should be updated and changed. If your developers actually keep the same technical plan between iterations, this should be a red flag: No one gets it right the first time. During development, your engineers will encounter business scenarios and technical problems the team simply didn’t think of. Each iteration should be a refinement, or even a roadmap change if you really just “got it wrong.”

Experimentation

As I said earlier, GM didn’t spend 8 years drawing the design out on paper until they thought they solved every possible problem that popped into their heads. After they defined their vocabulary and came up with a basic design, GM experimented with 25 different battery chemistries in real life using various test platforms to validate their design. Turns out, some manufacturers had no idea that their batteries tended to explode.

Research and development is something you can actually write off on your taxes. R&D should be done at the beginning of every iteration, after design, before development. Creating, measuring, and analyzing the performance, productivity, and maintainability of a particular design will help you weed out expensive mistakes. Developers should concentrate on creating specific goals, and communicate effectively to their peers what they did, why they did it, and what happened. A wiki is a great place to store R&D results and later, when creating the tax writeoffs, accounting can show that actual R&D was done.

This is an area that developers love, and business managers don’t understand. Business managers tend to think playing with shiny development toys is a waste of time when ‘real work’ implementing requirements should be done. (If you work for a company like this, I encourage you to quit your job. These types of factories deserve to be talent starved and are not worth working for)

The trick to making sure that real work is done is timeboxing R&D and setting goals. Specific goals should first be defined, then an appropriately sized timebox is allotted proportionately to the experiment. Specific goals might be evaluating one framework over another, or one design over another. Large R&D boxes should be allocated at the beginning of the project, with small ones at the end. Again, the goal here is to avoid waste. Being proactive rather than reactive, while counterintuitive, will only make your project succeed.

…there are things that we now know we don’t know. But there are also unknown unknowns. There are things we do not know we don’t know

-Donald Rumsfeld (An odd person to say such a quote)

The ultimate goal of R&D is to discover the things you don’t know that you don’t know. Discovering failures proactively saves you time and makes developers happy.

Conclusion

The difference between “coding” and “engineering” is that engineering wants to be deterministic and predictable. Coders tend to jump straight into things without giving long-term business goals thought, then jump ship when the waters get rough. The solution is to follow all steps of an engineering lifecycle while producing engineering artifacts such as models and plans. Finally, just like a car, models and plans require maintenance, and should be changed and upgraded frequently during the engineering lifecycle.

In the comments, how about some stories where upfront design was skipped? Also, lets hear some stories where management locked engineers into a failing design!

What is the secret to good architecture? While there are plenty of posts written about this topic on the internet, I found this article on DZone (originally posted here) intriguing.

The article is written by Niklas Schlimm, whom I do not know personally. The article can be summarized with three points:

Three [proposed] Laws of Good Software Architecture

Good architecture is one that enables architects to have a minimum of irrevocable decisions.

To make decisions revocable you need to design for flexibility.

To make use of flexibility one needs to refactor mercilessly.

CYA Engineering?

What I think Niklas is trying to say is: you will get it wrong, and sadly, you’ll get it wrong most of the time, so design an architecture where you can fix your mistakes with a couple of swaps. Is designing an architecture that allows you to revoke your decisions at whim really fair? Does the CEO get the same privilege?

Flexible is not durable. They don’t build make buildings out of rubber. We’ve probably all worked on over-engineered legacy systems that were “flexible.” Most times, good intentions turning into dancing circles around libraries and frameworks in an unholy ritual of abstraction.

You should be refactoring mercilessly, all the time. But do you need 10 layers of abstraction to do this? Do revocable decisions allow you to revoke custom code?

Alternative laws

Lets address the problem here. Code doesn’t age, or stop working, it’s Human’s that forget how things work, why decisions were made, and becomes disenchanted with the past. The input to the system changes, usage levels increase, and people leave the organization. The only real constant is the code itself!

Refuting the laws given with the alternative laws

Good architecture is one that enables architects to have a minimum of irrevocable decisions.
Alternative law #3 says you won’t be able to predict which decisions are irrevocable. Instead of designing an architecture with revocable everything, design the minimum possible architecture that barely scrapes by to get the job done.

Instead of developing your own bash build script, use Maven, and don’t do anything silly with it. “</amvn clean install” and “mvn clean deploy” should be all the commands you need to send your artifact onward.

To make decisions revocable you need to design for flexibility.
Alternative law #1 says you stink at designing for flexibility, so don’t do it. Rather than designing for flexibility, design an exit strategy document for each technology you use. This includes frameworks built in house: how will you exit from those once they become legacy code? (The best way is to abstain from writing them.)

Alternative law #2 also applies here, don’t use obscure technology like a NoSQL database until you actually determine you’ll need a a NoSQL database by profiling. It you’ve truly designed minimum architecture and opted for simplicity, it will be simple to change out your backing store with a different database. It’s good to remember that you are not Google. If you design simple architecture, you’ll have plenty of time later on to switch to NoSQL when it’s needed, rather than rewriting a thousand lines of Redis specific JSON handling code upfront.

To make use of flexibility one needs to refactor mercilessly.
Given Alternative laws 1, 2, and 3, this brings old law #3 into new light. You should refactor mercilessly and eliminate technologies. Gage yourself: How many new technologies did you bring in versus how many exit plans did you execute?

Develop a master engineering architecture plan. Every release should bring your closer to the master design, at which point, you should update your exit plans then update your master architecture plan. What technologies could minimize the architecture? Evaluate them… do they require glue code to be written? What frameworks can we exit from and what technologies aren’t working out? Get rid of them, quickly. What in-house utility projects (aka future “legacy code”) has been written? Open source it, replace them, or better yet, observe that YAGNI has manifested itself.

Conclusion, flame wars

When it doubt, don’t write it. Change your processes so you can use a tool without modification. Don’t dance around frameworks with unnecessary abstraction. Above all else, simplicity is key!

Finally, remember this is an attack on complicated software, not the author of the original post. I’ve enjoyed reading his blog. I look forward to your comments!

And now for something completely different. In a previous post, I said than no encryption algorithm is secure, but only resistive to attack. Well I lied. There is one, and only one, provably secure encryption technique. This technique, while wildly impractical, is a fun exercise, and was actually used to change world history during the Cold War.

So why are all encryption algorithms insecure? Simple, because they key is shorter than the data. Take this plaintext:

I believe that fear of life brings a greater fear of death.

-David Blaine

That’s 59 bytes of text, yet the typical size for an encryption key is usually 128bits, or 16 bytes, or 268.75% larger. That means at some point, a pattern must be used to expand the key over the plaintext. Encryption algorithms are designed to obfuscate patterns in the output, or make their intervals so large that recognition takes massive computation and memory. However, this work at MIT shows that entire ‘queries’ can be ran against fully encrypted data due to patterns existing in the output. My explanation is a gross oversimplification, but the takeaway here is that encrypted data differs significantly from random data.

One way to solve this problem is to make the encryption key larger than our plaintext. In 1917 two Americans, Gilbert Vernam (of Bell Labs) and Joseph Mauborgne (U.S. Army), had the same thought while following up on an earlier idea to secure telegraph systems. They hypothesized that if two machines shared a random sequence of numbers, and were separated by a great distance, they could communicate securely by combining one number from the plaintext, with one number with the random sequence, then at the receiving end, compensate for the random number. The result would be gibberish to an attacker, unless they knew the correct random number.

Later in the 1940s the US and the Soviets proved in separate efforts that this system was impossible to break. However, the system had one catch: if the random sequence was any way statistically predictable, it simply wasn’t secure. Hence the name “One Time Pad”. If the ‘Pad’ (originally a bunch of numbers typed out on a notepad) was used more than ‘One Time’, it could be compared against other messages, giving the output statistical value.

So how does this work in practice? First sample some entropy. My sample is: 5, 2, 8, 3, 6. The numbers in the sequence have no statistical meaning compared to another sample of entropy. If I shift my set down (4, 1, 7, 2, 5) or up (6, 3, 9, 4, 7), it’s still statistically meaningless, unless you knew something about the original data.

Unlike shifting a random sequence, the letters are still statistically significant. In English, ‘e’ is the most letter. This reflects in the plaintext, where ‘e’ occurs 5 times. In the obfuscated quote, ‘d’ appears 5 times. An attacker could easily surmise that ‘d == e’

Now, all the letters are statistically random. Since ‘^’ is now the most common character in the output, an attacker might guess ‘^ == e’ but would be incorrect. Without knowledge of the pad or plaintext, an attacker, even with unlimited computing power and time, cannot guess the actual message. Furthermore, one can supply a false pad and get a completely different message. For instance, applying this pad: 6, -4, -21, -4, 3, -1, -11, -20, -17, -5, 2, -9, -81, -1, -5, 2, -3, 21, -15, 4 to the ciphertext yields an entirely new quote:

Antiferromagnetically

-John Adams (on magnetism)

In the real world, XOR is used rather than adding/subtracting for a myriad of reasons, the two biggest being: it happens to work with binary data, and it covers the statistical significance of the alphabet up nicely. The real world also requires that pads be kept secret and only used once.

In conclusion, the One Time Pad is the only provably secure encryption algorithm, provided than the Pad is completely unpredictable. But with that security comes many impracticalities: distribution of pads, pad length, and pad destruction.

Look forward to a few more cryptography posts, then we will resume course on development topics!

It’s been a few weeks since my last post. Things have shifted around for me career-wise, and while I’m enjoying my new position, I’m transitioning out of my old role. Sometimes it feels like I’ve worked two jobs when I get home! Anyway, I’m going to take a break from the cryptography series and talk about something much more practical.

At my job, I’m finding that a large percentage of the questions that come to my desk are “Hey this thing you did is broke,” or “Hey, this thing you worked with is on the fritz.” 99% of the time, I’d have no idea why the behavior they were asking about was occurring! However, I think I’ve got a great problem solving methodology, which I attribute to being key in my success. Maybe some agile writer can take the concepts here and we can develop a formal methodology for solving problems 🙂

Preventative Measures

As professionals, we should accept that production problems are unavoidable. You will create defects in your code, and no amount of testing will ever weed them all out. Code is rushed, servers crash, disks fail, and datacenters overheat. Accepting the fact that production problems occur is very important, but most importantly you need to plan for them. As one of my favorite financial advisor says, “It’s not a matter of if these events will happen; it’s simply a matter of when they will happen.” Plan your week/month with hours available to investigate strange behavior, crashes or user complaints.

Your developers must be on call. I’ve discovered there is a strong correlation between the cleanliness of a developer’s code, and whether or not he had to be on call at some point in his career. People that don’t have to shovel their own shiztpoo do not have stake in preventing problems! Here’s the thing though, the on-call phone is for emergencies. There should be an understanding that if the issue isn’t happening right now, then it doesn’t need to be worked right now. Too often a pager is used to extend an employee’s work day! If you get to the point where you are looking at code, you should probably agree to throw up a 500 page and sick the whole team on the problem in the morning. Furthermore, if a user calls in the middle of the night to report an issue that they failed to report hours ago, chances are it can wait till daylight. It’s important to protect the on call phone, save it for when the meadow is truly on fire!

Use a bug tracking system! This is so obvious it amazes me how many companies do not have this. When events occur, update the bug report with an occurrence. It’s very important that you keep record of the event… When you suggest to management they give you time to fix an issue, having a record of the hours spent doing crisis diagnosis is an invaluable resource!

Keep an application inventory. Have your developers create a wiki entry for every application, that has the following information:

What configuration files are need to run

Which databases are connected to

What tables are read

What EIS systems are interacted with

What names of queues/topics are used

What usernames are used to interact with above systems

Where the logs are located

Known problems!

Do not underestimate how vital this information is! In your SDLC, you should have a step where the above information is updated!

Ok, I’m on call. The phone is ringing. What do I do?

Your first step should always be to get an accurate description of the symptoms. You then must narrow the problem scope to either being able: to reproduce it on demand, or predict when it will occur next. For instance, if you have a B2B webservice and the consumer is complaining your sending SOAP faults, your first move should be to have them dump the XML request that is causing the issue. If an end user is complaining about an error message when they click a link, you should have them copy the link and take screenshots. I call this “narrowing the problem.” This is _the_ most important step. It is not optional, you should not proceed to work on a system unless you know what the problem is! You could easily worsen the situation by making arbitrary changes, and the likelihood of doing something random to solve an unknown issue is abysmal.

What if the end user causing the problem is not available, or the person on the other end is unwilling to provide the information you’re requesting? I think the best tactic at this point to ask them how they would like you to proceed. Let them finish completely without interrupting. Perhaps they were onto something, or had an alternative. Or perhaps they’re in left field, and if it leans towards left field, explain to them that you cannot solve a problem you don’t understand and can’t recreate. Don’t be afraid to stick up for yourself, it takes information to solve problems!

Ok, lets get crackin. It’s late and I want to go back to sleep

After you’ve narrowed the problem to a very very specific story or use case, it’s time to get working. First, copy the logs of the production server onto your local computer. If the logs roll while you’re troubleshooting, you may never find root cause. Next, copy any data related to the problem out of the database and store in text format with the log files. This may mean taking a snapshot of configuration tables of an application, the user’s profile rows, and maybe the data they’re trying to access. If you think it’s related, copy it! If you think it’s unrelated, copy it! Finally, note the version of the application, the bugfix level of the server, and the version number of the libraries available on the server. Whew!

The next thing is to get a pencil and paper. You should write down ever step you took and the time, and more importantly why you did it. This must be done as you perform the steps, but it doesn’t have to be verbose. For example:

“3:12am Noticed the logfile contained many a transaction timeout, meaning the connection pool may be locked. Restarting the application server”

There are three parts to this entry, the first is the observation, the second is the analysis, and the third is the action to correct.

So how do you find the problem?

It’s time to use that application inventory that you did way back in the preventative measures section. Lets say the log is reporting that the user cannot be found in the database. Since you’ve already narrowed the problem, simply refer to your inventory and connect as the application, then proceed to locate the user. If you can’t find them, then problem solved (for now)! Your inventory will tell you exactly what table and database to retrieve this information from.

But what if you’re not finding anything? Well… you need to poke around some more. Filter through your logs files and database entries. Do you see anything that looks out of place? Since you’ve already narrowed the issue down and reproduce it, can you try reproducing it using a different user? How about if you tail the log in realtime while the end user create the problem again? Can you enable debug logging in the frameworks (Spring, Hibernate, etc)? Can you reproduce the problem in a non-production environment? How about remote enabling remote debugging and stepping through as the system executes? How about cleaning up temporary files or deleting JSP compilation directories? Has the system ran out of disk space? Finally, does a good old fashioned restart fix the problem?

The hangover

So you were on the phone for an hour or two last night, what do you do the next day besides show up late? Well before you leave your house, take that pencil/paper with you. Open your bug tracking system, or send an email to the owners and include the original complaint, the logs, the database entries, and what finally fixed the problem. Transcribe your quick notes into the ticket. It’s best if this is done the next day, before your forget how you fixed the problem and why you made the decisions you did.

If you followed every step in this entry, you should have no problem giving management a thorough report of what was wrong and how it was fixed. This information is critical to mend over any rough spots with your customers. It can also be used to CYA if any flak comes your way! Finally, I hope this will make you a problem solving rockstar, ready to write better, simpler code (with tons of debug statements using SLF4J).

This article is one of many in my Strong Cryptography series. Today we’ll dive into some deeper to some Hashing topics. Also I’m taking suggestions for subjects in my posts… Last time it was US Founding Fathers, this week it’s tragedies.

We learned in our last article what a hash function is and how for every input, there is exactly one corresponding output. We’re going to apply that concept to two commons situations where confidentiality is needed.

Common Situation #1: A situation in Verona

Lets say you have Romeo who needs to email the following message to Mercutio: “I dreamt a dream tonight.” Unfortunately for Romeo, all the ISPs in Verona are owned by the Montague family (Not much better than Comcast). Romeo doesn’t mind if the Montagues can read his message, but he certainly has an issue of they change his message. How could Romeo be sure his message was delivered without being modified by the Montague family?

Well you could run the message through a hash function and attached the hash to the message. The Montagues however, could simply modify the message, and attached the new hash value. To get around that, I suppose Romeo could calculate the hash, then tell Mercutio in private what the hash value was. However, this would defeat the purpose of emailing him, since in private he could just tell Mercutio the contents of the message.

Any ideas?

Introducing the new MAC (not what you think)

Fortunately, there is an easy solution: Message Authentication Codes or MAC. A MAC function is a ‘keyed hash function’, so you could call it a cousin of a normal hash function. In fact, MAC functions and hash functions are nearly interchangeable! In the last article, we covered the properties of a hash function. So what makes a MAC function different from a hash function?

You must posses a secret key can produce a hash value.

A MAC function shouldn’t reveal ‘hints’ about the secret key, even if an attacker chooses the input.

Because you must posses a secret key to produce the hash value, they are often called “Keyed Hash Functions.” So lets run the quote “The lady doth protest too much, methinks” through a MAC function:

If you use a different key (lets change the last byte to 0x0d) you’ll get a completely different hash:

cDdJwEBm9qIni9A7QfIQ1e9G8qo=

So how does apply to Romeo and Mercutio? First, they meet in private and create a secret key. Both Romeo and Mercutio will know the secret key. Next, when they want to send an email, they run the message through the MAC function and produce a signature. Finally, the recipient runs the message through the same MAC function, using the same key. It the calculated signature matches the signature on the message, they can be assured the message was not tampered with during transmission.

A MAC is actually just a wrapper

Take note of the following line in the above implementation:

Mac mac = Mac.getInstance("HMACSHA1");

You may remember from the last article that SHA1 is a fairly standard hash algorithm. Well here’s the whole story: a MAC function is just a wrapper around a standard hash function. The HMAC part of HMACSHA1 is the method of wrapping. There are other MAC implementations, such as VMAC, but they are less commonly used. VMAC performs a little better than HMAC. Since HMAC and VMAC are just wrappers, you can take almost any hash function and turn them into a keyed hash function. Using SHA1’s older brothers, we can run the same quote through the program:

Common Situation #2: Revisiting storing passwords in a database

If you recall in the last article, we decided on the following table structure:

USERNAME

PASSWORD_HASH

SALT

tjefferson43

HiE0AZhRWAs6Mmd7dVqppM1WtJc=

WTg68cVxPI

paulrevere1776

aktJ/0cn69Y41vssNfZDHY1CsdE=

sWVUaTGa6e

Lets add one additional column:

USERNAME

PASSWORD_HASH

SALT

MAC_KEY

tjefferson43

RFu51fVI/0y8gmPXkjo0Op9FRHU=

WTg68cVxPI

KvzpFe690vsVcW1jLqQ1vg==

paulrevere1776

SmZedewg1JD7kRhndEW2cRmcyGw=

sWVUaTGa6e

FgIpIDFi9kJgXvkp44ua4Q==

So what did we do here? Instead of using straight sha-1 function to store a representation of the user’s password, we now use HMACSHA1. What does this do for you? Not much. The better question is what does it to for hackers… It makes their life extremely difficult!

A rainbow attack uses a multi-terrabyte database to do a reverse lookup on our hash algorithm. The database is generated ahead of time, possibly by cooperating parties and thousands of separate computers to divide up the workload. Even with the combined effort, this still takes months to generate useful tables. However, with the finished rainbow table, an attacker could lookup a hash function output to get an arbitrary input that produces the same hash as the user’s real password. Yikes! The nice thing is, using HMACSHA1 makes this infeasible for a hacker. Because we have keyed our hash algorithm differently for every user’s password, they would have to collectively regenerate the rainbow table for every individual entry, making the rainbow very very very difficult in practice!

A brute force attack is related to the rainbow attack. A brute force attack tries random combinations of letters and numbers, or even common english words until it finds an input that produces the correct output. Unfortunately, HMACSHA1 provides little additional security against brute force attacks. However, because they require powerful computing systems to run, they are less common. There are only two ways to defend against brute force attacks:

Enforce good polices around passwords:

12 characters

No english words

at least 2 numbers

at least 2 symbols

at least 2 uppercase letters

Use SHA-512 or SHA-256 instead of SHA-1.

If you’ve enjoyed this article, all that I ask if you vote me up on DZone and HackerNews. Until next time, thank you for reading!

This article is one of many in my Strong Cryptography series. Today we’ll be covering one of the most useful Strong Cryptography primitives: The Hash function. These little beasts are amazing and have a multitudes of uses!

So, what is a hash function? Glad you asked, lets quote the unabated Wikipedia:

A hash function is any well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array (cf. associative array). The values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes.

Sigh, Wikipedia used to have nice user-friendly definitions for common folk, since that’s not the case, let me give my own definition:

A hash function takes some input bytes, then calculates a fixed value that identifies that sequence of input bytes. Sorta like a checksum… Wait, a checksum IS a hash function! Cool!

-Jonathan on cryptography.

A (very bad) hash function could be as simple as “divide everything by two until you have one digit left, rounding up as you go along.” We could this function ƒ (ƒ stands for ƒailure of a hash function), and so ƒ(48) = 6 or ƒ(451) = 8.

A simple way to use this function ƒ is to protect a website. When a user creates an account, they choose a number as password (p). During the account creation process we calculate a new number (c = ƒ(p)) and store it for the user. We end up with (c) number, 0-9, in a database for each user.

To check a supplied password (s) against the database, we simply evaluate if (c ==ƒ(s)).

This hash function ƒ isn’t really very useful in the real world, as our keyspace is only 9 possibilities, it’s scales horribly, and it’s trivial to reverse engineer, but if we substitute fail function ƒ for something better, we end up with a secure system!

Technical Properties of an Ideal Hash Algorithm

It should represent a true many-to-one function. It should take an input, and produce an identity output that can be reproduced exactly every time. For every plaintext, there is exactly one hash value. For every hash value, there are infinite plaintexts.

The input plaintext can be any length. Yes, but see the next point:

The output hash value is always the same size, meaning it’s fixed length. So if you gave the algorithm a plaintext of 50 bytes or 765kbytes, the length of the hash value (the output) would as be say, 16 bytes. The exact length of the output depends on the algorithm design.

The always produce the same hash value (output) for the same plaintext (input). So if you hash the quote, “I go on the principle that a public debt is a public curse” you should get the value, vjPWnnLiB0BLrqUuJjtpEM5KClg= every time. This is called identifying, fingerprinting, or hashing the plaintext.

A small change in the plaintext will result in a dramatic change to the hash value. So if you change to Madison’s quote to “I go on the principle that a public debt is not a public curse” the hash value is now: g+8o7vlQAGjvlst+XsOEwIzF0Qc=.

Permanent and “one way”, meaning that the hashValue cannot be reversed to plaintext.

Given a particular hashValue, it is extremely difficult to find a plaintext that will produce that desired hashValue. These last two points run together… Basically, our function ƒ is far from ideal. Lets say you have (p = 5). It’s very easy to calculate an input (s) that would produce (5 = ƒ(s)). You could use any of (s = { 10, 20, 40} ). An ideal function should make this extremely difficult!!!

SHA1, the standard hash algorithm

SHA-1 (Secure Hash Algorithm – 1) is a hash algorithm that is quite complex, and is known to be secure at the time of this writing, for most applications (the paranoid should use SHA-256 or SHA-512). We can supply any length plaintext to SHA-1 and receive a consistent 20 byte hash value in return. Here’s how to do this is Ruby:

If we make a slight change to the plaintext, lets say: “Give me liberty, or let me eat cake!“, we will get a dramatically different output:

SHA-1(Base64) : M2vPzwTPc7ALM+OkiGAwCkS1DY4=

Actually, just changing the case of the first character “give me liberty, or give me death!” will produce a completely different hashValue:

SHA-1(Base64) : g1UFdWJfXWfkIVz42uLLxrJv58g=

Hash Algorithm Input is indeed infinte

Remember how we said the input to the hash function can be infinite? How about the entire text of The Declaration of Independence (12kb)? If you upload the file to the online calculator we’ve been using, the result is still 20 bytes:

Real World Usage of Hash Algorithms, Scenario 1: Storing account passwords in a database in a secure manner

You’ve been tasked with creating a website for the American colonial resistance. The British are abound and are randomly inspecting everyones’ databases for suspicious activity. The Founding Fathers have an unfortunate habit of choosing patriotic passwords like “I<3America”. How can mask passwords from British patrols so the secrets of who is loyal to The Crown, and who is a revolutionary is kept secret?

The best thing to do is created a ‘salted password table’ like this:

USERNAME

PASSWORD_HASH

SALT

tjefferson43

HiE0AZhRWAs6Mmd7dVqppM1WtJc=

WTg68cVxPI

What we will do is store the password hash, plus some random text called a salt (to add entropy to the user’s password). So Thomas Jefferson’s original password: I<3America now becomes I<3AmericaWTg68cVxPI and is hashed permanently to: HiE0AZhRWAs6Mmd7dVqppM1WtJc=.

So how do you process a login, now that the plaintext is gone? You perform the same calculation: Take the username from the login form, and lookup the salt. Then take the plaintext password from the login form and concatenated the salt from the database to reform the original string. Hash that string, and if matches the database hash, you know the correct password was supplied! Jefferson’s original patriotic password is completely masked, so those pesky Tories will never know his true allegiance! Yarr!

But what if another patriot (Paul Revere) chooses the exact same password, but is caught by the British? Couldn’t the British Regulars then just infer that anyone with the same password hash is also a patriot (and be hung)?

Because we added salt (entropy) to the plaintext password, we’re guaranteed to never have the same hash twice in our database. Lets say Paul Revere creates an account with the same password: I<3America, but we followed the required practice of using SecureRandom to create a new salt:

USERNAME

PASSWORD_HASH

SALT

tjefferson43

HiE0AZhRWAs6Mmd7dVqppM1WtJc=

WTg68cVxPI

paulrevere1776

aktJ/0cn69Y41vssNfZDHY1CsdE=

sWVUaTGa6e

So Paul Revere’s password I<3America becomesI<3AmericasWVUaTGa6e, which is hashed to: aktJ/0cn69Y41vssNfZDHY1CsdE= and stored. Later, when Paul Revere was caught and was revealed to be a patriot, Jefferson’s secret is kept safe by salting and hashing.

To summarize the benefits:

A database breach means your user accounts are still secure.

It’s impossible to tell if two people have the same password.

There are no restrictions to the length of a user’s password.

Your database column can be a fixed length instead of a varchar.

SHA1 was created by people who are smarter than you (and I). You can deflate your ego and leverage their brains.

Sidenote: The salt column is extremely important! Never ever, store, or send, a hash without a salt! Furthermore, the salt value must be securely generated with a random generator. You must use your language’s SecureRandom generator! You’ve been warned on these two points: they are CRITICALLY important (I can’t stress this enough)!

Real World Usage of Hash Algorithms, Scenario 2: Noah Webster

Yes, Noah Webster was an American revolutionary: He was a member of Connecticut militia, an editor of the Federalist newspaper, the author of American copyright, among many other things. He went on later in life to write some dictionary thing, but the important part here is he was a real hard ass when it came to teaching.

So say Noah Webster has hired you to write a program to test his students spelling. Because he so adamantly believed in spelling correctly, it’s important each student get 100%. Knowing this is a bit unrealistic though, he will allow the students to retake the test as many times as they want until they get 100%. The test consists of Mr. Webster dictating a small passage of text to the class and having them enter it correctly into your program. The program should tell them whether they spelled everything correctly. This will allow them to figure out the correct answer before turning in their final answer.

Another important piece of American history was that Noah Webster was a closet Ruby fan; you have to write this in Ruby. This presents a problem, because the students will be running your program, you can’t just embed the correct answer in your program, or they could look at the source code.

By using a hash, you can embed the hash of the correct text in the program and compare the students input to the hash.

Mr. Webster begins the test by saying to his students:

Before a standing army can rule, the people must be disarmed; as they are in almost every kingdom of Europe. The supreme power in America cannot enforce unjust laws by the sword; because the whole body of the people are armed, and constitute a force superior to any bands of regular troops that can be, on any pretense, raised in the United States.

The correct hash is: 6+IHAAJ1i0KmoPaowU1i0xSjcO4=

Knowing what you know now, can you finish the program for Mr. Webster? Post your answer in the comments!

One thing that amazes me is that the most developers are not familiar with strong cryptography. In my career, I’ve seen all sort of mistakes that lead to leaked data, guessable passwords, unfortunate disclosures, and worse. The nice thing is, you don’t have to understand the ridiculously complex math behind the algorithms, you only have to know the rules for using them correctly. By the end of this series, my goal is to de-mystify the magic, so you can start using the primitives in your code right away!

But first, when I say Strong Cryptography, what the hell am I referring to anyway?

One thing I’ve seen repeatedly done is that developer ‘invents’ a cryptography scheme for a particular purpose. Here’s the thing, cryptography is thousands of years old. If you’ve ever ‘invented’ your own way to ‘encrypt’ data, chances are you’ve just re-invented something that has been discovered thousandsof years ago. If you want to avoid the mistakes that WEP made with wireless, Microsoft did with the XBox, or Sony made with the PS3, this blog series should help you avoid embarrassment, AND give you something impressive to say at the next cocktail party.

Finally, I just wanted to mention this is actually a very personal subject that I have a long history with. I found my first need for cryptography was passing notes to my friends as we played “Spies” in the neighborhood and needed to keep the locations of our secret forts safe. Unfortunately, my single letter substitution cipher must have been broken by some whiz kid as our treehouse was destroyed that summer… After reading Alvin’s Secret Code, we then created 2-3 sets of Caesar wheels and never lost a secret fort again!