I might be on a pretty cool project. It seems that my boss is aware that all the features he's asking might mean a long, long term project and he's setting reasonable milestones, so we might write something truly interesting.

Some back of the envelope calculations lead to 10tb/year of data collection. I'm scared.

Around here, we'd be given insufficient hardware and then told that our implementation sucked because our data dedupe algorithm that was cramming 10tb of compressed data into 5tb of disk space wasn't fast enough on the barely sufficient CPU provided.

Some back of the envelope calculations lead to 10tb/year of data collection. I'm scared.

Around here, we'd be given insufficient hardware and then told that our implementation sucked because our data dedupe algorithm that was cramming 10tb of compressed data into 5tb of disk space wasn't fast enough on the barely sufficient CPU provided.

I find this happens a lot in conversations like this:

Buyer: Can product X do Y?Seller: Well, in the case of A where B is within specified limits and the time allowed is greater than C, yes. Otherwise, no.Buyer: Look, can X do Y or not? Because if not, then let's stop wasting time.Seller: If...yes, yes it can.(some time later)Buyer: X isn't doing Y, you lied.Seller: It's not within specified limits.Buyer: You said it can do it, fix it.

Replace Buyer with Manager and Seller with IT for internal procurement and the story stays the same. I find this more and more as I propose changes to a web app that are technical, and the manager can't follow them so he simplifies what I'm proposing into a one liner of "make X happen" and changing the actual work scope and complexity drastically. When I try to clarify too much, my boss gets a quick email about how I'm over complicating it.

Buyer: Can product X do Y?Seller: Well, in the case of A where B is within specified limits and the time allowed is greater than C, yes. Otherwise, no.Buyer: Look, can X do Y or not? Because if not, then let's stop wasting time.Seller: If...yes, yes it can.(some time later)Buyer: X isn't doing Y, you lied.Seller: It's not within specified limits.Buyer: You said it can do it, fix it.

Replace Buyer with Manager and Seller with IT for internal procurement and the story stays the same. I find this more and more as I propose changes to a web app that are technical, and the manager can't follow them so he simplifies what I'm proposing into a one liner of "make X happen" and changing the actual work scope and complexity drastically. When I try to clarify too much, my boss gets a quick email about how I'm over complicating it.

When I read scenarios like this, it makes me very, very glad that I don't work with simpletons who view the world in binary terms.

We probably go through a lot more than 10TB a year, but we only keep 30 days on disk. It's not that scary. Especially since if this is new to you, it's unlikely anyone will actually go through the data, which is where things get scary

When I try to clarify too much, my boss gets a quick email about how I'm over complicating it.

Does your boss reply with "You're oversimplifying it"?

Usually. The conversation can be a failure on both sides, the customer for simplifying too much or myself for not explaining in terms they can relate to with business logic. My boss has a longer relationship with them and better insight to what they'd understand, so he'll come to me and get my explanation, work out a less overly technical explanation, and go back to them and explain it again and why they need to know it.

We probably go through a lot more than 10TB a year, but we only keep 30 days on disk. It's not that scary. Especially since if this is new to you, it's unlikely anyone will actually go through the data, which is where things get scary

Indeed. We're into the thousands of terabytes now, and even a very very minor problem can lead to a staggering amount of data being missing, unavailable, corrupt, etc.

We very deliberately read all of it as a sanity checking exercise. If you haven't checked your captured data, assume it's broken, it normally is.

We have systems that read it for us - for reports, ticketing, etc. But of course, that means little when the system raises some alert that requires a human being to look at what is typically barely-machine-readable logs. Or the other aspect, you can't read what's not there. You also cannot detect that something is not there. You can detect that a LOT of things are not there, but that one thing? Ha! Have fun!

The third good one, and given this is the programmer forum may be the one most likely to be our problem, is when you have a system that reads the logs that breaks. Often silently. Because when it gets noisy, people complain, so they told you to shut it up unless there's a real problem. Prepare to be blamed for the missing data. And for silencing that shit years ago and not telling anyone

Easop's tale of the little program that spammed wolf is a modern classic.

Yeah, but I was talking about searching for the needle that doesn't exist in the needlestack. Both significant problems

Quote:

If your data really matters to you deal with it yourself. Third party abstractions have never had the same quality as doing it in house.

I guess. But whether you deal with it yourself or use someone else's software, you are still relying on software to do a job for you. And we all know how fallible programmers are, ourselves included. Like you said, you have to do continual testing and manual checks to assure yourself that data is being parsed correctly. But when dealing with large volumes, there's only so much you can do - would a system really detect a gig of random logs missing out of a 10TB/day log, and how likely is it that a manual check would come across it? I don't see the source of analysis really affecting that chance.

Missing data we know we currently care about? Yes we'd spot that sort of gap (aggregate numbers would go wrong)

It's spotting data we don't currently know we need that's tricky. For that there's nothing like capturing the raw data. In full.

It helps that the world has got better in the interim. Data streams without sequence numbers of some form are practically unheard of these days.

The interesting bit now is detecting things wrong with the data *despite it being a completely valid capture of the source* i.e. the root provider is wrong/has an awful bug. That needs someone working out if everything is internally consistent with lots of heuristic pruning of the search space. That's quite hard, if anyone is into that and wants a job (in London) we're hiring

Some back of the envelope calculations lead to 10tb/year of data collection. I'm scared.

Around here, we'd be given insufficient hardware and then told that our implementation sucked because our data dedupe algorithm that was cramming 10tb of compressed data into 5tb of disk space wasn't fast enough on the barely sufficient CPU provided.

I find this happens a lot in conversations like this:

Buyer: Can product X do Y?Seller: Well, in the case of A where B is within specified limits and the time allowed is greater than C, yes. Otherwise, no.Buyer: Look, can X do Y or not? Because if not, then let's stop wasting time.Seller: If...yes, yes it can.(some time later)Buyer: X isn't doing Y, you lied.Seller: It's not within specified limits.Buyer: You said it can do it, fix it.

Replace Buyer with Manager and Seller with IT for internal procurement and the story stays the same.

(emphasis mine). The easiest solution here is to say, "No" right then and there. All solutions have constraints of some sort, and building a system where the client doesn't understand that is a recipe for disaster. Frequently, the easiest way to get the dialog to move forward is to attach costs to what you can do, and what they want you to do. Giving them a choice (even a false one) can help substantially.

Quote:

I find this more and more as I propose changes to a web app that are technical, and the manager can't follow them so he simplifies what I'm proposing into a one liner of "make X happen" and changing the actual work scope and complexity drastically. When I try to clarify too much, my boss gets a quick email about how I'm over complicating it.

Generally, once you have a user need, you probably (unfortunately) shouldn't involve the user in turning those needs into actionable technical requirements. This goes both ways: you shouldn't let them provide undue input, but you also shouldn't solicit input or provide explanations in technical terms.

I suspect the manager needs an attitude adjustment and you could use additional help in explaining things in terms that work for them. I wouldn't even deal with the manager without passing everything through your boss (shouldn't he be taking point anyway?). Understanding user needs (and phrasing things in their terms) is difficult, and you shouldn't ever feel bad about asking for help in that regard. In fact, you should make it a priority to seek out whatever expertise you can, before responding at all.

Like you said, you have to do continual testing and manual checks to assure yourself that data is being parsed correctly. But when dealing with large volumes, there's only so much you can do - would a system really detect a gig of random logs missing out of a 10TB/day log, and how likely is it that a manual check would come across it? I don't see the source of analysis really affecting that chance.

The fundamental problem is considering size relevant. The problem still exists if it's 10GB and 1MB of logs, or even 1GB and 100K of logs, from a statistics and even a human perspective.

Such analysis is often tricky because although you can build a good history of data to tell you want you ought to see, the natural variance is often quite large so small amounts of missing data are in the noise. Moreover, if you're removed from the process generating the logs then it's hard to know when it should be or shouldn't be logging (e.g., if the software is intentionally shutdown for 4 hours, then you should be very surprised if you get logs during that time!).

That said, most of the problems I see with logs in particular fall into two categories:

The software used to move them from place to place is not particularly robust. Frequently, this is a failing of the tools available.

The software parsing them isn't robust to unexpected or unusual log input (e.g., extremely long lines or output that's never been seen before).

One effective solution for both is insert known mark records through some means at the source. If you don't get the number of mark records you're expecting, you know something broke.

It's spotting data we don't currently know we need that's tricky. For that there's nothing like capturing the raw data. In full.

I've said this before in other contexts, but it bears repeating: there is no substitute for the raw data. You should always design systems around the notion that the raw data is always authoritative, and no matter what you do, other people will need access to it. Likewise, you should keep it about as long as feasible.

Quote:

It helps that the world has got better in the interim. Data streams without sequence numbers of some form are practically unheard of these days.

Sadly sometimes, as palad1 alludes to, you cannot get the raw stream (except at a cost even my company would balk at). There are vendors that 'notionally' send the raw *data* but that balls up their network config so bad as to drop/reorder packets internally. They're fun.

Even the raw stream isn't authoritative for indicating what actually happened. They have bugs too

Well, if you can't get the source data, you do the best with the closest you can get. I don't think that changes the advice any, though it does make life pretty frustrating. Likewise, data corruption and loss is a real thing that happens, so it doesn't change my advice either.

It helps that the world has got better in the interim. Data streams without sequence numbers of some form are practically unheard of these days.

I'm not sure what data you deal with, but that is very much the opposite of my world. Hell, sometimes we're lucky just to get timestamps - and if you do, sometimes it's the syslog system putting in on there, not the generating system, which leads to all sorts of issues.

Quote:

Such analysis is often tricky because although you can build a good history of data to tell you want you ought to see, the natural variance is often quite large so small amounts of missing data are in the noise. Moreover, if you're removed from the process generating the logs then it's hard to know when it should be or shouldn't be logging (e.g., if the software is intentionally shutdown for 4 hours, then you should be very surprised if you get logs during that time!).

I envy those who do not have such natural variance. Even on a good day, our hour-to-hour variance is huge, and bears little resemblance to yester(day|week|month|year)-at-this-hour numbers. Then again, if everything were perfectly consistent, I'm not sure they'd need us and I'd be out of a job

Quote:

The fundamental problem is considering size relevant. The problem still exists if it's 10GB and 1MB of logs, or even 1GB and 100K of logs, from a statistics and even a human perspective.

Yes, raw numbers are immaterial to an extent (except when humans are involved - needs more eyeballs/time to see more). Percentages are still relevant. A better way to state it would be, would your system know if it were missing 0.01% of the logs, or even 1%? When combined with large amounts of variance, it is a very difficult problem to solve.

The best way we have to overcome this is active monitoring, where status is polled. Not always possible for certain stages, but at least provides the "is it still alive?" answer when logs are silent/less noisy. It still does not answer whether all the logs made it through, but is often indicative of such issues.