Re: When and when not to put information in a database.

I'm in agreement with a normalized database being a good fit for the
problem, in theory. I agree with everything you've said, in fact. My
problem with accepting an RDBMS is the overhead of normalization on any
implementation that I've ever seen.

The "aggressive timeframe" in which the main application must run is about
three hours, but I always like to use as little of this as possible. This
is an inflexible process flow constraint, and so whatever we do has to
perform. Making all 13 million accesses in three hours means about 1200
reads per second. If the database is normalized, we're now looking at 1200
* 10,000, or 12,000,000 rows per second. Can this be done in 2003 in any
RDBMS? I don't know.

My solution, OTOH, is up and running (as of this afternoon) at over 2000
records per second on my P3-500 with 640megs ram, however on a reasonably
small subset of the data (but still 10K prices per record). On a big
multiprocessor machine with ample main memory, it should scream. (If not
then I'll have to eat my words and my wasted time)

But back to the main point, 12 million records per second? The normalized
database strategy is both robust and aesthetically pleasing on paper, but if
it's infeasible, then what? Does this not seem to break the normalized
camel's back? Or are there (RDBMS) solutions/optimizations/tricks that I
don't know about?

Jesse

"Steve Kass" <skass_at_drew.edu> wrote in message
news:b3udqh$g2k$1_at_slb5.atl.mindspring.net...
> Jesse,>> It makes a little more sense. But when you say that price 4249> is not 4249th in sequence, presumably 4249 is still an important> distinguishing factor of price4249, since when you evaluate a> portfolio using price4249 for ABC, you will be using price4249> for DEF, GHI etc. So 4249 has no numerical meaning, but it is> still data (and not metadata).>> In order to be able to associate the ABC price of 54.223> with the JKL price of 13.781 (and use them together> when evaluating), you need to answer the question "which> price for JKL?". It's not enough to say "the one in the same> column as the ABC price of 54.223," since in the relational> model, columns contain metadata, not data, and questions must> be answerable using data. (The question can contain metadata,> of course, as in "what _price_, what _age_, what _date")>> So the number 4249 specifies the particular "simulation instance"> a price belongs to. This is why I want it stored in the table, not> misplaced in a column name.>> I don't recommend it for efficiency, but using GUIDs to answer> "which" by specifying a price list instance would make it harder> to misunderstand the meaning of that information. This is always> a risk when one uses 1, 2, 3 to label things that don't have> respective 1-ness, 2-ness, 3-ness. But integers are handy, and> safe here when you realize this.>> In any case, I do see that you have more data than I originally> thought. It looks to me like you are still doing nothing more than> matrix multiplication (your statistical stuff aside). If you need to> serve ad hoc queries as well as this matrix multiplication, I would> still recommend trying a normalized table with the appropriate> clustered index (Instrument, "simulation_instance) presumably.> Whether another solution is faster, I don't know. If you need> both query-like information and computational information from> the same data, it's not a surprise that no one solution works, and> it may be that the solution you've developed is a good one when> both are needed.>> Again, though, an RDBMS has a great many features beyond> query processing, and if those, such as atomicity, integrity, etc.> are important, be careful if you think you can provide them with> something that wasn't invented yesterday.>> Steve>> Jesse wrote:>> >Steve,> >> >I'll try to explain the data and it's use-cases a little better.> >> >200K instruments, 10,000 simulated prices each, but let me do this with 5

> >instruments and 10 simulated prices:> >> >> >Instrument: price1, price2, ..., price10> >> >ABC: 50.234, 54.223, ..., 49.841> >DEF: 23.401, 22.198, ..., 28.598> >GHI: 34.422, 34.199, ..., 34.841> >JKL: 15.208, 13.781, ..., 14.234> >MNO: 86.942, 81.295, ..., 78.384> >> >Now we have some portfolios, with positions:> >> >Portfolio #> >(Instrument Quantity)> >(Instrument Quantity)> >. . .> >> >> >Portfolio #1:> >ABC 150> >JKL 200> >> >#2:> >GHI 150> >JKL 1000> >MNO 5000> >> >#3:> >ABC 8000> >GHI 1000> >JKL 4000> >> >Etc.> >> >For each portfolio, we grab each instrument with all 10 simulated prices,> >and then do some statistical stuff. (This involves all 10 prices for
each
> >instrument included)> >> >But in my case, I've got ~200,000 instruments, not 5, and 10,000
simulated
> >prices, not 10. Several hundred portfolios with anywhere from 1 or 2, up
to
> >a over a hundred thousand positions each. (Some of these portfolios are> >large firms like Merryl Lynch, Goldman Sachs, etc.) A total of around> >13,000,000 positions, when summed over all portfolios.> >> >In a given day, about 150K of the 200K instruments will end up being> >accessed, and all 10,000 prices are needed for each position. (Which> >instruments will be needed won't be known in advance, so we calculate for> >all instruments)> >> >The data storage should be able to serve ad-hoc queries, as well as daily> >batch jobs. But when you access an instrument, the 4249th simulated
price
> >has no useful meaning out of the context of the other 9999 prices. SQL
on a
> >completely normalized database: "SELECT Price FROM TheoreticalPricesWHERE
> >InstrumentID = ? and Sequence = 4249" So any ad-hoc queries will be> >accessing an instrument with ALL prices, not just one or few. (I could> >explain the why of this, but that would take a while--our quants
explained
> >it to me and it makes perfect sense.)> >> >Does this make it a little more clear?> >> >> >"Steve Kass" <skass_at_drew.edu> wrote in message> >news:b3n5ur$969$1_at_slb3.atl.mindspring.net...> >> >> >>Jesse,> >>> >> If I understand this right, while you have a large volume of data,> >>you need relatively little I/O and arithmetic. You don't say whether> >>you check every simulated move for each portfolio, but even if you> >>don't, 13M positions processed, even assuming no overlap, means you> >>access less than 1% of the data.> >>> >> What am I missing or what haven't you mentioned? - this doesn't> >>sound hard for a normalized table that's keyed on InstrumentID> >>and ScenarioSequence. It sounds like there's a good chance it> >>will parallelize well, too, since the simulated moves sound independent> >>of one another.> >>> >> It sounds like you are basically doing some (large) matrix> >>multiplication, and it might be faster in a traditional array-type> >>data structure, but I'm guessing there are other issues - updates,> >>the frequency with which the scenarios are re-enacted, etc.> >>> >>SK> >>> >>> >>> >>> >> >> >> >>Received on Sun Mar 02 2003 - 21:39:42 CST