The data product "TAQ NYSE Order Imbalances" from the New York Stock Exchange is in a format that is described pretty well in sections 4.8, 4.9, 4.10, and 5 of the document "NYSE Order Imbalances Client Specification", version 1.12, q.v. Briefly, it's a mix of ASCII and binary: stock symbols, for example, are in plain text, but padded by null bytes, while numerical fields are in binary digits represented by a byte with that binary value. All fields are fixed-width, so data rows simply follow one another.

(From the FAQ list: "We can answer questions for software packages that are specific to quantitative finance". Just in case anyone's thinking of closing this as off-topic. (But if you think it'd have a better chance of being answered elsewhere, please let me know!))
–
msh210Feb 4 '13 at 20:37

1

This is the multicast data, ie the live feed. This is not the historical data.
–
chrisaycockFeb 4 '13 at 20:59

@chrisaycock, yes, but the format is similar and the live data's specification explains it better.
–
msh210Feb 4 '13 at 21:00

I assume then you have a file that's in this format. In that case, you'll need to read the message header to determine the type, and then interpret the type from there as an unpadded struct. Because this is a compact binary file, you'll likely need to write your decoder in C, or at least some language that can specify memory layout and field size.
–
chrisaycockFeb 4 '13 at 21:02

@chrisaycock (1) Part of your latest comment ("interpret... struct") is Greek to me. If it's still relevant after you read the rest of this comment, then can you explain, please? (2) I am not a programmer. In particular, I don't know C. (3) As to the first part of your comment, do you mean the first few bytes of the file? (a) I don't think the data file uses that sort of marker. I think it strictly follows the format specified in the linked-to documentation and summarized in my question. (b) I don't know how to read those bytes.
–
msh210Feb 4 '13 at 21:06

2 Answers
2

I think you do not need to be a "Systems Programmer", certainly not an experienced one, to solve this problem:

1) Focus on the header, its your legend to the file structure. It describes the format and essentially already tells you how to decode the following messages.

2) Depending on your choice of language you then process each message in binary format and convert each item to the numeric format. In C#, some use "BitConverter" but obviously C# is not the language of choice here. If you can tell me which specific language you use to make the conversion then that would be helpful. A lot of people use Python to convert this kind of stuff to a higher level text based format such as csv or any delimiter-delimited structure.

3) Before you convert you may want to think carefully whether you may want to perform operations on the byte array representations of your numeric values (I am not familiar with your mentioned specific feed, though some feeds only output the "alpha" rather than full spread, for example, thus you need to perform add/subtract operations which can in certain cases be more optimal to perform on the byte array itself). Here is an example :
http://stackoverflow.com/questions/3641274/c-sharp-int-byte-conversion

Here are couple Python examples just to show you how a simple byte[]-> int conversion could be done:

P.S.: It won't help you but I find mixed message formats very inefficient but that is not your fault. Most efficient streams only ship byte arrays, nothing else. A symbol should anyway never be in string format internally, but rather be assigned an int32 or int64 code. Mapping internally is much faster than converting each symbol of each message from byte array to string. Also, even if the symbol is decoded in ASCII that is very inefficient and blows up message size.

I doubt sed gets you what you look for. Sed is used To parse text, you mentioned your stream contains byte arrays. To need to know what each field represents from the documentation you mentioned. Then you simply google how you can make the conversion in ,for example Pearl or Python, to the variable type targeted. Really not that hard if you ask me.
–
Matt WolfFeb 5 '13 at 18:38

Not sure what you are actually struggling with. You laid out the 40 bytes per message in much detail in SO (is saw your question there). I could write you a program in c# in less than an hour to get the job done. I make the assumption here the source is also a file of binary nature correct?
–
Matt WolfFeb 5 '13 at 18:45

I'm not sure what you mean by "the source is also a file of binary nature". The problem I'm having (which, yeah, I guess I explained better on SO) is parsing the bytes. I guess, as someone mentioned there, od would help... but I'm on Windows. I also don't have the wherewithal to hire you to write a C# program for me (or the trust to run an executable you send me). I will keep looking around....
–
msh210Feb 5 '13 at 18:51