This is my personal blog about programming. In my 20+ odd years I found that programming is mainly problem solving. In my earlier days I was handed out the Pure Mathematics book by Hardy and found that what gem it is. Clarity, chronology, and preciseness has been achieved at the highest level in my opinion. So that inspires me a lot... And I try to follow it in my professional career. And that is the only reason I gave this blogs objective - Pure programming!

It's parsing and not much different from what I said in the title. Yes, it is wild wild west. As you go along doing your day to day business, no matter how hard you try not to think about it, you eventually end up with stuff that you will feel that it is in your finger tips or you it is far out!

Parsing is one such thing when you try to program. I've seen many different mistakes and not so intuitive for lot of people ( including me). So what is parsing, pattern matching etc.???

A bit of a history first, if you don't happen to know. Long long back, computers were very specific to do one type of computation or other. So there were computers for scientific computation, some were for massive data proccessing and business computations. So a genralization effort brought them together to have universal ( sort of ) computers where you can do many different types of computations. Currently, it is just simply taken for granted.

Now, for tackling interesting and somewhat big problems, there were already few languages sutiable to specific use, instead of using almost machine level language Assembler. For example, Fortran for scientific computation, Cobol for business computation. There were others that came, did their service, and went away. Since some of these languages were to solve specific types of problems, they were bit awkward for other types of problems. Moreover, the I/O systems were thoughtout to have block level interfaces ( specially to optimize system performance), and was directly exposed to the programs.

A major shift was that the I/O should be represented as a stream of something, and let the program interpret the way it wants. Most successful of this effort is the birth of C language, where stream of something was stream of characters. And the encoding for character is ASCII. Computer needs bits and bytes, not characters, not a word, not some formula... nothing.

So given a character stream what should I interpret it to, and how? This is where parsing comes to play. Basically what it means is that I, the program, want to interpret this stream this way, I know I can get meaningful information out of it.

No wonder, it is parshing county, Nevada in around 1850! Wild wild west, everyone has her own rule and interpretation!!

Parsing, to me still very elegant art, Art of Programming. It is mostly used to parse streams for computer language statment validation/interpretation. It has a rich foundation of computer science. Scanning a stream, giving it meaningful representation, then parsing is heart of computer languages and lots of other things. And to appreciate the beauty, and its non intuitive nature, just try to code up a program to parse a stream of text, and findout out if there is a valid c program statement. To be more specfic -

int main(){;;;;; return 0);} - Is it a valid c program? And insert a dot ( . ) somewhere to see what it says. Very soon we will see that if-then-else, switch etc. going to make the program a living mess. And two questions we will have to answer (1) Is it correctly interpreting (2) Is it efficient?

So there has to be a science behind it too!!. If not, then why not would be the question for inquiring minds.

One of the old tool of the trade is to craft a ring buffer for quite a few things to be used it for! If you, the reader did not hear it yet, you will hear it soon!!!

So what is it? And what is the use?

Well, sometime you just want to know the recent history of certain things, say kernel events, or the recent acitivities of a server or even your application. So you want to constantly pump out trace messages to a buffer, but your buffer might have some restrictions in terms of how much memory you should allocate for it. And optionally, you might have a backing up procedure which will do the backup ( or rather write to a file) intermittently so as to assume that if you loose the buffer information, at least you expect the file would be there to take a peek.

While this can be used in a variety of situations, we need to understand one thing is that the paradigm I try to follow is "Coding for debugging"... So anyone can have an infrastructure, more on this later, to be dropped and use in your software for debugging and other stuff...

When I try to develop some code, one thing I try to keep in my head is to have it as a reusable code. Then the otherthing I want to follow is that can I try it in user mode code, before bringing it in to kernel.... And finally can I have some way to introduce debugablity into it, from the primitive printf and cousines to elegantly using language features!!!

In this example, the ring size is fixed, in terms of how many elements it can hold. For dynamic sizing it would have to be changed. There are situations, depending on how much memory is available in the computer, the ring size has to be self tunned, which necessiates it to be dynamically sized ring.Another missing part is intermittent flushing into a file, and yet another part is missing is how to use the language features to debug this infrastructure...

I would not go into the dynamic sizing, it could be another note based on request(s).

For intermittent flushing to a file, one of the approach is to run another thread that will act as a consumer, and take the ring buffer to a file. Not very difficult to do that either.

For the language use, why do we need this? Well, the message producer could be an aribitrary source, and then the ring buffer thread would be an intermediaries ( like the way cache works), and consumer would be another thread. So one question would be - how can I bypass the ring, and want to see the producer is producing, and consumer is directly taking it from producer and writing to file. For this an elegant solution is to have virual functions that can be redirected on the fly using lazy binding. So it would take only one line change to direct the messages to file, instead to ring....

Succently, the producer of the ring could produce nothing at all or it could produce to the ring or it could produce directly to final destination, the file or it could produce to both with or without any assistence from cosumer of the ring. And if we cover all these cases, it would be easy to find the problem component of the infrastructure.

The base code for such a ring is given as an example, but full implemenation for handling dynamic binding and debuggin including producing in a lifo or fifo are intentionallly left out for now. A c++ class based using pure virtual functions and threading works just fine for me -

I do know how important it is to have a proper communication, as I said earlier. Most people learn to have a persuasive and intelligent communication that results in achieving objetives. But in computer communications between communicating processes, there is hardly any room for being persuasive, or rather persuasive means correct.

In my previous note, I mentioned how shared memory can be used for interprocess communications, but all the synchronization will be needed to achive correct communication. And it is something that can be delegated to the underlying systems by using other methods. In windows, one such method is to use named pipe. It is FIFO, and bit more.

The objective of my experiment was to have send and receive loop between server and multiple applications (client) that are to communicate with the server. Essentially, the server will have a dedicate pair of channels between a specific client and itself. So if 10 clients try to communicate with the server, there will be 10 pairs of IPC channels. Each pair has two unidirectional channels going in each direction. The choice of having a pair of unidirectional channel instead of one bidirectional channel is to simplify the implementation and debugging. By nature, an application gets a handle to an opened instance from the system, and that's it. Now to make a sane channel, you and I will have to wrap this as an element of an abstract channel. The abstract channel can have lots of additional information like: state, message received or sent, last used time so on and so forth. Now the channels are really cross-bar switch. The read-end of one is the write-end of the partner and vice versa.

I started out with such abstract channel, just because I knew that for debugging and performances I would be needing these. One problem, and it is always the problem when communicating processes are to have a sustained communication is - THE DEADLOCK. This is mainly due to foreseeing the flaws in assumptions.

My assumption was that if (1) send and receives are blocking call (2) If they starts in alternating ( i.e. ping-pong) fashion (3) IPC channel is flawless, then I don't need any synchronization. Theoritically this is true. We can argue by stepping thru the scenarios that satisfies the above assumptions and show that there does not need any synchronization...

Now since I implemented the IPC, how could I prove the channels are somewhat ( if not totally) flawless. This is the reason I started out with the abstract channel. But assumption (1) was not true. Blocking here is from the local system point of view, it is not blocking or synchrous with respect to the other end. The assumption was that the call will be blocked until the local end-point knows how to satisfy the call. The end-point is the end-point of the channel instance.

The problem is that, if you just try to use the above assumption, and try out IPC as stated, it would work. In my case I was testing some of my own software, that does not necessarily send the message to the local end of the IPC channel, depending on the load of the system. Hence making send side essentially asynchrous. Correct implementation will have a pending queue with associative aging algorithm to get it off the queue and send it, then we would not see the deadlock in ping-pong style communication. But that was the side-effect, in a good way, to find the bug out to see why it was deadlocking...

Net result of this note, is that as long as the asumptions (persuasiveness ) are clearly understood, communication is fun, otherwise big challenges to find flaws in assumption(s).

Really, I'm strong believer of communication. Actually in every form, be it written, spoken, sign and signals. It is really the input of all facet of learning. And who does not want to learn?

Communicating Processes means two or more processes will communicate among them. Simple example of it is two processes communicating with each other. And the media is of course information bus. This information bus could be almost anything. Processes talks over wireless, wired networks, over physical media other than what we call network today. The most fundamental aspect of communication is - Signal processing. This is in the physics and engineering domain. But our topic here is digital communication. Particularly, using Windows systems.

Usually the steps to achieve a good / effective / reliable communication, the following steps are important -

Get a way to programmatically Talk: It is to transmit junk back and forth to see the bus is active and raw.

Devise protocol(s) to have good / effective / reliable communications

Incrementally implement.

Test & Debug.

In order to achieve a quick ( Well, fairly quick I would say ! ) implementation, I took the simplest approach first. It is the shared memory technique in user level programming. In a hurry, I will have few bad pointer references, and user level would save lot of pain and agony if you understand what I mean. Now shared memory is thru file mapped paradigm in Windows system. So the information Bus I selected is shared memory.

In order to make things simple, I picked a pair of such information bus, each in one direction. Here we can have a choice to communicate from one end to other what bus would be used in which direction or we can have an apriori assumption about which bus would take what direction. For shared memory, I just took the apriori assumption by giving explict name like: client shared memory, server shared memory. The naming is purely based on who would be using the bus for writing information to the bus. So Server shared memory information bus is for server to write and client to read. Things are simple. Why? Because by the time client or server starts, everything about the bus is already in place. The just blindly need to communicate with each other, just like the way we talk and never thought of the atomosphare ( particularly eather ) that acts as a bus.

Once the bus is in place and in raw mode, we can talk - no matter how nonsensical they are :). This is to see information is getting exchanged. Albiet, both side having bad experiences in learing, if anything since it is like talking rubbish, and no communication at all.

Now the next step is to carry the payload across with meaningful context. Remember in older days, when we had very noisy telecom line for long distance trunk system and we use to loose context and used asked for what what, please repeat etc.

So this meaningful contex is really the protocol. In this particular protocol we had the following -

1) A payload would be processed only once by the destination since that payload was conveyed only once.

2) The payload with associated protocol information is called message. Each message is atomically processed.

3) Every message has to be processed.

Now since a message is processed only once, consumption of the message means it is gone from the pipe once processed. In our case the reader will atomically read, and if it is indeed a message from writer, it will atomically erase. On the writer side, it will make sure that the pipe/bus has no message ( it is single message channel) and atomically write the whole message.

The payload is wrapped with hdr informations that indicates say: client id, sequence number, message length etc. The message itself is within the total payload.

Following these simple rules, it is fairly easy to cookup a base line information bus for communicating processes.

Significance is subjective, hence fairly easy to answer! It just depends on what you are dealing with, what is / are the impacts. So it could be nothing to enormous, it all depends on the affect of these on us!

I was reading a book "exploring randomness - by Gregory J. Chaitin" that I bought a few years back. If you read and understand the concept behind it, you will sure appriciate the vastness of this... But to touch some areas ...

-- Dialects: At machine level, different compiler will have different code generation.

-- Their Idols: Many since they love to emulate their idols

-- Capability: Simple killing of a program to make a system unstable to take down the internet etc.

-- Stealthness: Again totally visible to totally stealth.

There are other traits, that anyone can find online. But imagine one thing, given a binary file, sometime it is essential to find under which compiler it was generated and many disassemblers might be needed to capture the essence when analyzing such infected module...

Now if I just take the dialect of C, there are at least three different compliers ( and more if you try out open source ): Watcom, Boreland, and Microsoft. Now there are quite a few calling conventions: _cdecl, Pascal, _stdcall, _fastcall. Also for C++, as an example the register convention to passing THIS pointer. ONE QUESTION WOULD BE TO GIVEN A BINARY, how do I dissassemble correctly and uniformly? YET ANOTHER QUESTION IS, How do I know what language and what dialects was used?

Observation, just a couple months ago I was debugging a BugCheck in Windows kernel, and I saw it says image is corrupted. Two questions: What got corrupted? How got it corrupted? By dumping near by disassembly, it was clear that code was corrupted. Well at least the Windbg disassembly was showing that corrupted places. Note, in the past I played around with disassembly a bit to see, how it handles the disassembly if I present with say some addresses of the .text segment. Let's say foo() is a function, and I say u foo, where foo is the starting address, then try foo+1, foo+2 etc. You might see that disassembler would blisfully ignore and try to interpret what you give. NOW YOU SEE, how easy to get a corrupted code segment, and the result is known.

But when the .text is properly aligned, not corrupted, we always want to be able to say - Ah, this is the dialect of that language being used here. THAT WOULD LEAD US, to have some mental make up to look for know patterns, and try to analyze the rest...

Cautious reader might have one question - Why did I mention the book about "Understanding Randomness"? What if any it has got to do with this discussion? Well, if you are really serious 'bout it, I would recommend yet another book - Malicious Cryptography By Dr. Adam L. Young & Dr. Moti Yung. I'm sure this will enlighten lot of readers about the depth of this topic!!!