Saturday, July 30, 2011

Well, the experiment seems to be a success, Igor. The IPN endpoint has been in production and functioning for a couple of weeks now. Messages are coming from PayPal, hitting the distributor, and reliably being passed on to the IPN Endpoints. Not a single missed message. We have two of them running now, one for each subscription system.

We've had more trouble with the endpoints being unable to handle the extra streams, frankly. (Don't cross the streams, even though you can) The IPN protocol is whatever the opposite of "stateful" is, so they already had enough trouble trying to figure out what each incoming message was for, and what to do with it. Although the more specific the trigger (like 'Affiliate Program' software that only cares about the sign-up, not the subscription process) the better it works. Fortunately if things go weird, we just turn off the spigot to that IPN Endpoint, or apply some filters.

The only thing close to a severe 'bug' I encountered was not paying enough attention to the character encoding. Here are the two crucial points:

PayPal sends IPN messages in charset "windows-1252" by default.

Google App Engine Java programs expect "UTF-8"

This wasn't really apparent until we had a few payments come in from people in Europe and Asia, with their hoarded supplies of umlauts. If you're wondering 'Why the difference?' it's because PayPal is apparently still stuck in 2002. I don't know.

The incompatibility doesn't seem like much, but in the GAE Servlet environment you need to call setCharacterEncoding()before you call getParameter() for the first time. Since you don't know what the encoding is until after you read the 'charset' parameter, we have a catch-22.

The 'correct' solution to this is to pull apart the input stream ourselves (thus duplicating all the Servlet code which already does this for us) in order to determine the original encoding. Then set the encoding format on the request object, and then re-read the data parameters as normal. Ugh.

However, if you try this, you will run into the next issue: the verification callback to PayPal must exactly match the message you were sent. If you translate the incoming message into UTF-8 before distribution, the IPN endpoints won't even know what the original encoding was, and therefore how to send the message back.

The verification will fail, and PayPal will keep resending the message until timeout. Bad.

Now, even if PayPal is clever enough to ignore the raw difference in the character encoding field and do all the relevant translation perfectly (ha!) that still isn't good enough. So long as there isn't a complete 1:1 mapping between the two character sets (windows-1252 and UTF-8, and there isn't) there will always be 'edge cases' of particular characters that can't make the round trip. Certain names will break the system. Not normal names, but they'll be in there due to strange keyboards, or malice.

The best solution is to back away slowly from the entire mess of character encoding translation and do the following:

Set the IPN character encoding to UTF-8 in your PayPal account.

That's it. I really don't know why windows-1252 is still the default for new accounts. Pure inertia, I expect. That brings IPN up to code with the modern internet.

You'll have to check that your existing IPN Endpoints are UTF-8 compatible, of course, but if it is I suggest you do this sooner rather than later, as future-proofing. (At least test it.) And for new accounts, do it on day one, because it's much scarier changing it once transactions are flowing. One mistake and they all go tumbling away into /dev/null.

Of course IPN messages that are already in-flight (or re-sent from inside the PayPal account) continue using the old encoding, (Arrrggghhhh!) so the switch-over takes a couple of days and can't rescue failing transactions. The solution for already-sent messages is to manually validate them in each Endpoint (if you have that facility) to push the transactions through.

They're real transactions, it's just PayPal will never acknowledge them once the translation gremlins have been at work. Fortunately it's only the random names that will be a little screwy for a month, (before the next transaction corrects it) not financial data or item numbers.

When I get time, I should at least add an 'expected encoding' option to the software for those situations where you have absolutely no choice and need to force the Distributor into another character set. That might have all kinds of consequences, though.

Another logical extension to the IPN distributor would be to also intercept the validation callback from the IPN Endpoint to PayPal in order to translate the encoding back to original. (Also, to mark off that endpoint as having received the message and prevent re-sends) On close inspection this is quite a bad idea, because it removes the 'cross checking' inherent in having PayPal separately validate messages passed by the Distributor, and creates a single point of attack within the distributor/proxy. It might become necessary, but it's still a bad idea. Plus, how many IPN Endpoints allow you to change the callback URL from the "paypal.com/webscr/" default? (Some, but not all.)

It works. A base has been established. There's clearly a few extra problems to solve, but I have other things to mess with before that.

Thursday, July 21, 2011

Oh my Gods. And I thought I didn't have enough time before. There literally aren't enough hours in the day. And my attempts to create extra hours (mostly squeezed in just after midnight) backfired rather badly.

You see, we went live last week. The space shuttle launched, and then so did we. ("the last shuttle is away... aw.") And I didn't think we'd told anyone about it, but somehow we've already got two customers. Two paying members! How did they even find out?

There's two people running around inside my application and I don't know where they are. Well, I do. Sort of. There's logs. But I miss being able to just lean over their shoulder and ask "why are you doing that?". Analytics is nice, but not enough. I want screenshots.

What's terrifying is how many things are still broken. Well, I say 'broken'... I mean they work, but not all the time. From the end user perspective this means occasionally having to press 'delete' twice, or unjam a queue by restarting a server, which I'm told isn't that big a deal.

For me, however, every system jam is a failure in my logic. I'm just Vulcan enough that it personally annoys every damn time, deep down.

My system must be a perfect system, with all bugs and errors banished to the land of wind and ghosts.

In practice, there's a point of diminishing returns. No matter how good I make my code, I'm still subject to the workings of the internet, various phone companies, and often Microsoft. So long as my error rate is lower than that lot, I'm doing well.

And there's also a point beyond which you make things worse. You have one line of code that does something, and it occasionally fails. So you write another line of code to catch the exception and perform the fallback... but that could fail too. So you add another line. Soon 90% of your code is exception handlers which almost never occur, (and then all at once!) and so are full of bugs due to lack of testing. And the module doesn't fail cleanly anymore, it always half succeeds, leading to some of the most subtle and bizzare bugs I've ever seen.

Fail fast, fail hard. These are actually the cornerstones of a reliable system, because you can always retry. Just keep pounding 'reload' until it works, and trust in Idempotency.

It's good. Clean, simple, and obvious, just what you expect from Google. In fact, it's simpler to use than most of their offerings like Wave or Docs or even Blogger.

It looks a lot like Facebook, but only at first glance. Then again, Facebook looks a lot like Orkut. I expect there's only so many ways to display a list of messages from friends with comments attached.

It feels private and secure. It takes just enough effort (one click-drag) to add people into your circles that it's not something you can do accidentally. You might choose the wrong circles to post to, but that's down to you.

You can explicitly prevent 'resharing' or disable comments on a per-message basis. That's pretty much the opposite of Facebook, which uses almost everything you say as an opportunity to start a discussion with everyone you know. Disabling reshare is good for those semi-private things you don't want accidentally spread around by well-meaning friends. And disabling comments becomes important if you've got thousands of readers and controversial topics, like a news service or pop star.

The Beta state of the software is still quite apparent. Functions and comments sometimes disappear for half and hour, and then come back again with an expression of "What? I was resting my eyes! What?". The drag-and-drop of people to my circles broke at once point, but a reload fixed it. There's a couple of missing things, like sorting circles. All very minor surface stuff. But you can feel the stability of the underlying code. At no point was I confused about what was going on.

Even in Beta, it's remarkably forgiving. I was re-arranging my circles and made a big mistake: I removed people from the old circle first. Since that was the only circle they were in, and the interface only lists people currently in your circles, I lost them. (Add to the new circle first!) However, a little later, the unattached names all turned up in the 'suggested friends' list, so I unexpectedly got them back.

Mostly, the impression is that Social Networking has finally grown up a little. I expect Facebook will continue to reign amongst drunken college students who assume that every comment that falls from their self-centered lips needs to be broadcast to the entire universe as quickly as possible.

But us slightly older farts want to be able to bitch about our work to our friends. (And talk about that other job we're going for.) We need to be able to carry on secret affairs and double-lives. We don't need to make snide comments about people to their face under the guise of a general post about our day. A mature social network needs to support mature relationships, and Google+ tries very hard to do that with it's 'Circles'.

In the end, I expect there's room for both. Google+ will make social networking a useful business function, while Facebook continues to hold onto their original 20-something high disposable income Farmville demographic. In fact, this separation might be good for both of them, so long as they learn to co-exist. Multiple players legitimize a market, but a war isn't good for anyone.

Friday, July 8, 2011

The IPN distributor has been in production for a couple of days now, passing actual PayPal transactions to one of the small subscription systems. This is the "run-in" period, (another machinist's term IT has appropriated) as distinct from the "smoke test" for hardware. (Literally, "plug it in and see if smoke comes out". A tense moment for anyone who has built an electronics kit.)

Basically, there's nothing to do at this point than sit back and wait for the Big Nasty Real World to break things. And I'm out of town for a couple of days, so I'm literally forced to just let the run-in version sit there and not mess with it, which is exactly what I should do.

This project neatly involves quite a lot of things I care about, as one of my Majors was Artificial Intelligence, I've written code for massively parallel machines (the 4096 processor MASPAR) and built Artificial Neural Networks on my old ARM-based Archimedes A310 which was co-designed by... Steve Furber.

Steve Furber and Sophie Wilson are together responsible for ARM, perhaps the most successful processor of all time, if you count cores shipped / in use rather than dollars or profit. The Nintendo DS has two. Your smartphone may have up to five. There are more ARM processors on the planet than people. They made 2.4 billion CPUs in 2006 alone.

I doubt I'm uniquely qualified to critique this paper, but there's probably not many of us.

First, the good. The paper contains the general outlines of a computer that contains 1 million processors, capable of simulating a neural network about 1% the complexity of the human brain (a billion neurons) in real-time. And this isn't just hand-waving. The design is based on some custom ASICs (standard cores, but some specialised network interfaces) but then Steve does that kind of thing in his lunch break. ARM and Stilistix are on board, and ready to make silicon for him.

I'm quite sure he could build this machine, turn it on, and it would work pretty much as advertised. And a million processors might sound like a small building's worth of rack space, but your average ARM core probably costs two dollars, and will run off a AAA battery. This is how you can pack 50,000 card-sized nodes into a couple of cabinets without melting through the floor - or blowing the budget. This is a cheap machine, relatively speaking.

Steve makes the point that your average computer costs as much in electricity as the original hardware. Since CPU power increases exponentially with every generation, but energy is getting more expensive, your power bills become the limiting factor.

However, the most impressive thing about this machine is not the processing power, but the networking. If you're simulating a billion neurons, you have the problem that each one connects to a thousands others. Every (easily simulated) neuron event generates a thousand messages that need to get to other nodes. It's not enough to have massively parallel processing - you need massively parallel connectivity too. Steve's self-healing toroidal mesh is just the ticket.

So, that's the hardware sorted. How about the software?

Well... ahem. I think the most surprising thing is how little the field of Artificial Neural Networks has progressed since I left University, nearly two decades ago. Either that, or our little group was more advanced than I realized. Reading the new paper, all the descriptions of neuronal behaviour are pretty much the same as my day. This shouldn't be too surprising, I suppose. How neurons physically work was sorted out by medicine a while ago.

And this is probably why Steve Furber's"Spiking Neural Network architecture" is basically the same as the scheme I came up with in 1992. (Ah, if only I'd published...) We seem to have gone down the same path, for the same computational reasons, Only I think he's missed some things about how the biology works.

Before I get into that, I should point out that even though I had a cool idea long ago, there was absolutely no possibility at the time of putting it into practise. Steve has done the hard part, designing (and funding) a machine on which to test these theories. Finally.

And when I say "he's missed something", he's done it for the best of reasons; by approaching the problem a little too logically. Being an outstanding engineer, he has sought to remove the inefficiencies of squishy biological neurons and replace it with clean and perfect silicon.

However, it's my belief (any untested theory is just a belief) that two of the squishy inefficiencies that he's trying to remove are actually vital to the entire process: the time it takes signals to propagate along the dendrites, and the recovery time of the neuron.

I could go into a little chaos theory here, but let's just stick with the concept that, after a neuron fires, it's unavailable for a short time. And since timing is everything, this is equivalent to inhibiting the activity of the neuron for the window of the next spike. Perhaps an important one. We don't have to turn a neuron off to inhibit it (this takes a lot of time and effort) we just have to fire it prematurely. We are deliberately creating high-speed 'disappointment'.

Let's combine that with the dendrite propagation time, for some simple mind experiments.

Imagine a neuron that has it's output connected to another neuron by two connections of different lengths. So when the first neuron fires, the second receives two signals, with a delay in between. If the potential is large enough to fire (spike) the second neuron, then it will probably fire when the first signal reaches it, but can't fire when the later signal reaches it, because it will still be in recovery.

Let's say it's that second firing window which is actually the important one, needed to trigger a final neuron at the perfect time. So if the first neuron is prematurely firing the second neuron (which means it's still recovering when it gets the 'proper' signal later on) then it's effectively inhibiting it from doing it's job.

To let the second (wanted) signal through , all we have to do is fire the poor second neuron even earlier. An earlier spike (from yet another neuron) would inhibit the inhibition spike from the first neuron. Doing it at just the right moment means the neuron has recovered in time for the final spike.

So we can alter which of two time windows the second neuron spikes in by sending a 'control' signal a little earlier. Not exactly boolean logic.

Any percussionist knows all this. Hit the drum too soon (when it's still vibrating from the previous strike) and it sounds different.

This is of course completely inefficient, although elegant in it's own way. Any decent engineer wants to simplify all this inhibiting of inhibitions via multiple redundant paths. But I think it's an important part of how our brains process information, so removing that 'inefficiency' will prevent the network from behaving like it's biological counterpart. All that chaotic delay might be the seat of conciousness. Who the hell knows.

But I don't see this level of temporal control in Steve Furber's paper. In fact, he seems to be actively working against it by desynchronizing the mesh, although that might just be the lowest hardware level, and perhaps 'global time' is accurately established by each node's clock. The spike packets apparently contain no time information, and seem to be broadcast to all receiver neurons instantly. That sounds like dendrite propagation is being ignored, to me.

Without very tightly modelling propagation delays, how do we recreate all the subtle phase timing of a biological brain? This is my concern. From what I've seen, nature takes advantage of every trick it can use.

Fortunately the fix is pretty simple: Just tag each strike broadcast packet with a global time, and then have a delay queue on each neuron's inputs to process them in time order. If the mesh communications is fast and reliable enough, perhaps just the delay queue is enough.

Monday, July 4, 2011

So I'm writing a PayPal module. Again. Third time. Once more unto the breach, indeed. Maybe this time I'll get it right.

See, first go I tried doing it the 'right' way. I read all the documentation, ran all the sample code, and got a sandbox account. Everything was lovely, until the day came to deal with real transactions.

The second system was born from the still-burning ashes of the first in a very phoenix-like way. (I assume most phoenixes start life confused and terrified by the fiery collapsing buildings they would tend to wake up inside...) I never want to be patching a live database's table structure before an IPN message retry deadline again.

Here's the total sum of everything I wish I'd known before starting:

PayPalLies.

With a little more experience I realize that they are at least consistent liars, a slight mercy.

Now for some of you, the word 'Lie' may seem a little strong. Perhaps it is. Consider the following definition of the "receiver_id" IPN field from the PayPal Documentation:

receiver_id

Unique account ID of the payment recipient (i.e., the merchant). This is the same as the recipient's referral ID.

Length: 13 characters

On nearly every transaction, at least every one you see from the sandbox, this field contains our own Merchant ID, and we are encouraged to check this is indeed our own shop identifier before allowing the record through.

It's only when you finally get an "adjustment" record (extremely rare) that you see the mistake. On an adjustment, "receiver_id" is completely missing. But wait... there's our merchant code, but what's it doing over in "payer_id"?

That's when it twigs. PayPal means the receiver of the payment, not theIPN message. (Which is also called a 'receiver' endpoint) So when PayPal reverses a payment, they also reverse the ID fields.

This might sound sensible... for about five seconds. Random ordering of the 'From' and 'To' fields based on another 'Type' field is really not a great way to build foolproof systems. Especially when one of those fields is supposed to be used as a primary fraud protection measure.

A clever person might ask: "what happens if you buy something from your own shop? How does it handle the same ID in both fields?" but at least that question is easily answered: PayPal will not allow you to transact with yourself. Apparently it sends you blind.

So, when they literally say "(i.e., the merchant)" in the specification text they are badly mis-characterizing the API, leading you up a garden path. It's not like there's other documentation to cross-check this against either. At best, they're being confusing in an API intended to process millions of dollars worth of other people's money.

Here we go, I'll spend thirty seconds re-writing that definition to what it actually is:

receiver_id

Unique ID of the PayPal account that received this payment.

Length: 13 characters

And it's shorter, too. But even better would be using a design pattern that doesn't make the mistake of mixing concepts. Instead of a payer_id and a receiver_id which swap order, y'know, depending - they really should have a merchant_id and a customer_id which always have the corresponding data in them.

Even though payer_id and receiver_id are exactly the same data type, they are semantically different and their content should never appear in the other field. The consequences are just too terrifying.

I remember reading about a similar design error in the US Army's Cruise Missile Targeting Request software form which had the 'Target Destination" and the "Request Source" GPS co-ordinate fields right next to each other, with predictably unfortunate results. Sometimes data schema validation goes beyond just making sure the value is the right length.

Also, which 13 characters? Numbers? ASCII? UTF-16? A spec shouldn't be this ambiguous. Any "specification" which doesn't say things specifically is, to my mind, a big fat lie in a binder.

Sunday, July 3, 2011

I made another slight improvement to my AJAX multiplexer, of an amusing sort. I made it go faster (overall) by delaying the first message.

See, it turns out that when some AJAX data is needed to fill in a control field (I'm going a scattered widget-based scheme rather than neat standard single-form blocks) it tends to be requested in 'flurries'. A couple of controls will pop up together, (possibly as a result of another AJAX call) get their acts together, and decide they are hungry for more data.

Previously, the first data operation would immediately kick off an AJAX operation. This blocks other outgoing messages until the call comes back (otherwise we're using multiple connections, and that can cause issues elsewhere) during which time more operations queue up.

Once the first small request came back, the rest would get sent in a big batch.

By merely putting in a timer so that you schedule the AJAX request to happen soon but not immediately (10 milliseconds, in this case) you give all the other widgets enough time to 'jump on the bandwagon', and everybody else gets their data faster.

This has a surprising impact on the feel of many page operations. The update behavior just seems more logical. Even though the exact same amount of data is flowing to and fro, it turns out timing is very important for this, as in comedy and love.

Also, I just invented another mini-language. (Well, I'm copying Javascript, sort of) I was getting sick of iterating through the Java Maps and Lists that I get from parsing JSON, and I was really missing how easy it is to reach into a Javascript structure and pull out random items. Java's strict types are actually a bit of a pain, now that I've spent some time in prototypal script land where everything is infinitely malleable.

And that's just iterating over one level with some pretty easygoing constraints and using my convenient is_blah utility functions copied from PHP. Once you get to three or four levels of structure, the amount of code involved starts gets quite staggering and hard to read.

Wouldn't it be nice to use Javascript-like syntax to just grab nodes out of the JSON structure just like jQuery's selectors grab DOM nodes?

That sure would be nice, I thought. And the more I thought about it the nicer it seemed. So I wrote one.

It's only a few hundred lines long, and it lacks nearly every major feature of note, and in fact all it does is, given a arbitrarily complex JSON structure like [{"a":1,"b":2,"c":[1,2,3]}] and a path selector like "[0].c[2]" it picks out the expected entry from the structure. In this case, "3", the final element.

It's like running a tiny, tiny bit of interpreted Javascript code in compiled Java VM, just to pull out a single variable. I can't tell if it's silly, or pure genius.

This collapses enormous repetitive wodges of code down to tiny constant strings expanded by a library. Whether the code runs faster or much, much slower depends on a lot of factors, and isn't straightforward to guess. Several tricks to pre-cache the tokenization would help out immensely, but that's for when I get time. (No premature optimization here...)

The other thing the selection language does is to provide a kind of error checking that's sadly lacking Javascript. each step of the path is either an object property name (a dot, followed by text like ".name" or ".a") or a numeric array index (a number, enclosed in parenthesis, like "[0]" or "[9]") which forces an expected type , at least between objects and arrays. Normally Javascript can use either syntax to access either object properties or array indexes, but throws random exceptions when your expectations don't match the data.

Totally missing features: it only selects one thing at a time, rather than lists of things which I already want to do. And it doesn't know about the standard escape characters yet, but that's just a limitation on the selector string, not the JSON data. The Jackson library is still taking care of all that, and doing a marvelous job.

Now that I've added this handy little feature to my toolbox (and spent the last hour writing test cases to try to make it break) I'll be better prepared for the future. Maybe it will even justify the time. I live in hope.

Friday, July 1, 2011

Later on I'll be talking about The Problem, the formal definition of what we try to solve. But for now, let's look at just a small part of it; PayPal's amazing relationship with independent software developers.

Most people on the internet know about PayPal. It's ubiquitous. I'm writing a new PayPal 'backend', intended to manage user accounts. This is not my first time on this particular merry-go-round, so I'm working with a large base of tested and reliable code, but PayPal is special.

Two months ago, PayPal added a new field to it's IPN messages, called "ipn_track_id". It's filled with a big 22 character long unique-looking alphanumeric hash.

Hopefully this is the long-sought-after "idempotency marker", much desired by PayPal module developers in the same way the Arthurian Knights were quite keen on the Holy Grail.

'Idempotency' means 'doing it twice doesn't screw things up'. It has become a very important concept in making mixed internet systems reliable. See, there are 'edge cases' on days when both the network and users are being silly that cause issues that lead directly into the Twlilight Zone...

Consider, if you will, the following user story. It is a day like any other, when Arthur Q Public makes a PayPal transaction for a simple product item on an internet web site. Then, while intending to cancel another order, accidentally reverses the one he wanted to keep. Paypal begins the reversal process, locking Arthur out of further changes. Arthur, upon realizing what he has done, re-purchases the item he really wanted, expecting everything will get sorted somehow. Little does he realize that the IPN message system is passing on his purchases in random order, plus retries, on a day when the PayPal validation callback service is dropping connection from a DDoS attack, or traffic from Random Pop Star's latest album, which from a systems point of view looks like the same thing.

Depending how your system answers the question 'Is this record the same as the other one?' means the purchased will be be correct, (processed, reversed, processed again) punitive (purchased, purchased again but ignored, reversed. leaving none.) doubled (purchased, purchased again, reversal doesn't match latest, leaving two purchases) or worse (multiple interleaved retries and reversals, leaving god knows what.)

Payment records are fairly safe, because their Transaction ID works as the idempotency id. If you get the transaction twice, you can safely ignore the second. The Subscription ID works the same on the signup record... and that's the first problem right there, because suddenly you need to use different ID's depending on the type of the record.

Is there a 'Record Type' field? Sure there is! At least three of them. (Doh.)

Oh, and not all records have an ID that makes them idempotent, especially things like reversals and cancellations

So the sudden appearance of what looks like a consistent idempotency ID is exciting news. It significantly changes the processing logic of an IPN receiver. And remember this new field is being sent by PayPal as part of every single IPN message, it's not like it's an optional request. Everyone who writes or just runs an IPN endpoint (ie: an Internet Business) is expected by PayPal to deal with the existence of this new field in the IPN messages, and deal with the consequences.

You'd expect it to be documented, right?

Right? Suddenly pushing a new field at every single piece of IPN software out there, you'd expect a little Change Management documentation? Here's the highest ranked exchange in PayPal's own developer forums: https://www.x.com/thread/51579

While we wait for the documentation to catch up, could you please confirm that this field will be a max of 22 characters?

... and that's it. It's been three months.

There's a similar story attached to the "notify_version" field suddenly changing to "3.1" from "3.0", and "internal tracking purposes" was mentioned as part of the reason for that one too.

Remember that the IPN protocol is similar to email; we all run receiver software expecting messages to conform to a particular format. Database tables are created and maintained with columns for each field. (well, mine don't. I know PayPal too well.) Transaction field values are thoroughly checked to prevent fraud and abuse, and deviations from the standard (such as the version changing from "3.0" to "3.1") can literally set off alarm bell sound files. You would expect PayPal to give the developer network a bit of a heads-up before arbitrarily changing the specification, especially because getting these new fields is not optional. There is no way to turn them on or off.

My personal theory is there is a tier of 'paid-up PayPal developers' who are given access to this advance information, with legal non-disclosure requirements that are apparently well enforced, because it's hard to find any information in the usual developer places. This would give them a reason to be hostile towards the Independent Developers, in order to give the paid-up members an advantage for their commercial software.

The alternative is that PayPal simply doesn't care about documenting IPN, and treats all developers like this. Basically, a choice between malevolent or uncaring. Tricky.

I really want to use the 'ipn_track_id' field... If it is what I think it is, it solves a lot of problems. But like everyone else, I apparently have to reverse-engineer it's meaning from the messages we are given.

I have to guess? While writing software expected to process actual money transactions? Really?

Fortunately, I have a large collection of data to look at. Not sandbox transactions, but a year of payments and edge cases like reversals and dispute adjustments. Data that, by definition, I can't share or use as examples. And I can tell you that PayPal's own 'sample transactions' don't even come close to the complexity of what happens during a dispute.

I hope this gives you a tiny insight into why PayPal site integration is so variable, and why so many smaller shops seem to have such difficulty with automatic processing, and go for larger 'checkout' services who take yet another cut on top of PayPal fees, and the original Bank charges, but don't allow for 'deep' user management. Small systems and independent coding projects (like the Drupal module) are fragile in the face of PayPal's arbitrary protocol changes, apparently by design.

Update:
Nope. A little data trawling has suggested that "ipn_track_id" has an entirely different meaning, and solves a different problem. This value is repeated across two IPN messages that were otherwise hard to match together: the signup and payment transactions that are generated when the client starts a 'subscription' recurring payment.

Yes, it does also match across retries for the same record, but this seems secondary in purpose. It might technically be idempotent when combined with the "txn_type" field, but at that point we're back to magic combinations of fields again.

Also I just want to mention a minor thing - if you happen to be designing an API like PayPal's; while having a field like "reattempt" might sound like a good idea, but it means that each retry data request is subtly different from the original. So otherwise simple and reliable methods like storing message hashes won't work unless you strip those (otherwise quite useless) fields. A shame, more than a problem. It means that we need one set of logic for spurious TCP/HTTP retries, (which can happen, especially if a proxy in involved) and a second set for when PayPal reattempts explicitly, despite it making no difference to us why the glitch happens.

But it fits with PayPal's egocentric worldview. We are apparently supposed to care how many times PayPal tried to give it to us, enough to justify them altering the transaction message. But we don't get a simple field which lets us know which message the retry is for, in case we already got it.