On Sat, May 7, 2011 at 10:20 AM, Steven Rostedt <rostedt@goodmis.org> wrote:> On Sat, 2011-05-07 at 16:44 +0200, Ingo Molnar wrote:>> <snip>>>> * Steven Rostedt <rostedt@goodmis.org> wrote:>> > Here's the choices then:>> >>> > 1) we get libparsevent.so out into the world and all tools can use it, and>> > the raw formats of the trace events will no longer be an issue as long as the>> > names of events and fields stay the same.>>>> Firstly, such an event parser already exists in>> tools/perf/util/parse-events.[ch], so if you want to librarize it please talk>> to Arnaldo to create tools/perf/lib/ and a libperf.so.>> The pares-events.[ch] file in perf just parses the command line options> to denote what events need to be recorded. The real work lies in> trace-event-parse.c that does the real parsing, and that file was copied> from libparsevent.so. I purposely wrote that as a library that perf> could use (again, being open to other ideologies), but instead of using> the library, the code was hard coded into perf, which guaranteed the> forking of this code.

A small note that this has also created some minor frustration for me,as I had to send patches to extend this parsing to two different treesand maintainers. (I don't think the perf patch ever was applied... Ishould follow up on that.)

I believe it's been suggested that trace-cmd should be part of thekernel tree, just like the perf tool. This would be nice, and wouldmore easily allow them to share this parsing code. It would also givethem a common place to work on their unification.

>> <snip>>>> Secondly, you are solving the wrong problem and you are not seeing the real>> problems. We can keep and we *will* keep ABIs, it's not hard. 4 bytes padding>> is not an issue and it never was for PowerTop nor for any other real person who>> relies on tracing.>> I've Cc'd the Google folks that are very interested in tracing, to let> them respond to that comment.

Thanks for raising it to my attention.

The size of events is a *huge* issue for us. Please look at thepatches we have been sending out for tracing: A lot of them are aboutreducing the size of events. Most of the patches we carry internallyare about reducing the size of events. Memory is the most scarceresource on our systems, so we *cannot* afford to use large tracebuffers. This means that with a 8MB/cpu buffer (this is about what wecan hope to allocate on a heavily loaded system), we can only get onthe order of 10 seconds of trace data at best when tracingsystemcalls, irqs, and sched_switch. This is not enough when we don'tknow what exactly we're looking for.

We have gone so far as to add an entire second set of "_tiny" eventsfor syscalls that records only 16 bits of arg0, and for sched_switchthat records only next_pid. These alone have roughly doubled the traceperiod, and it is still not enough.

I really can't stress enough how big an issue the size of events isfor us. It is our number one issue with tracing.

>> Do you think that "other real person"s are only kernel developers or> desktop users that are not using production systems?>> And it's not just 4 bytes, its the entire useless header. Who cares> about preempt counts? I examine that field only 1% of the time. In most> other cases its totally useless. Same with interrupt flags, although I> do look at them more often than preempt count. We (Frederic and I) still> want to get rid of the pid for every event.

Internally we have dropped all but event type and pid (and changed pidto 16 bits), and we have plans and patches in development to drop pid.

>>>>> As i see it the problem is the thought-less ftrace churn and the fragility of>> how TRACE_EVENT() can be changed.>> Now you are just insulting me. There has not been any "thought-less"> churn.>> I *designed* TRACE_EVENT() to be changed. That's why I wrote all that> code to export the event formats. If you think all raw data of events> are to be an ABI, then lets rip out all the event formats and make> everything hard-coded.

We have tools that rely on TRACE_EVENT formats being exported. Thiswas a factor in our choice to use ftrace, and I consider parsing theformats to be part of the ABI.

It may be true that some fields of some events should not changeincompatibly, but having the formats exported allows a wide definitionof "compatible": mostly that the name should not change. Even changingthe width of an integer field works perfectly well.

>> I'm sorry, but that mentality seems to encourage thoughtless churn.>>>>> Really, ftrace and in particular you are showing a huge disconnect and i'm>> increasingly unhappy about it. Look at this very thread: you fought tooth and>> nail to even *acknowledge* that there is a problem...>> I agree there is a problem, but what each of us think the problem is, is> different. I say there's a problem with tools depending on a layout of> raw data that is internal to the kernel, especially when it was designed> to allow robustness. If we make it easy for tools to extract the data> properly, then there should not be any issues if the raw format changes.

Agreed. It also allows forward compatibility when new events or newfields are added, or when an event changes.

I see tracing as primarily a debugging tool: It is about inspectingkernel internals. You cannot expect kernel internals to change, andnot expect something that inspects those internals not to change theformat of the data it exports. Kernel variables, structures andparameters will change, disappear, or become meaningless or useless(eg, lock_depth); and they are supposed to as much as the kernel issupposed to change improve. Luckily (or actually, by design), we havea way to cope with this: the event formats are exported for tools toread.

I think ftrace has an abi, although I'm not sure how recorded it is.In my view it includes:- the debugfs control files (events dir, buffer_size_kb, options,set_event, etc)- the format of the ring buffer pages and ring buffer event headers- the format and meaning of the event format files.- For a few select trace events, yet to be enumerated, the presence,name, and wide-sense field type (integer, array, dyn. array) of someselect fields (eg "next_pid" in "sched_switch").

In my view it explicitly does *not* include:- the exact content of the event format files, except as noted above.

>> Linus said:>>> If you made an interface that can be used without parsing the>> interface description, then we're stuck with the interface. Theory>> simply doesn't matter.>>>> You could help fix the tools, and try to avoid the compatibility>> issues that way. There aren't that many of them.>> To me this seems that we have a way to have the tools do the right> thing. If a library can be used that allows a more robust interface,> then why not use it? The library already exists, I talked to Arjan, and> he's willing to use it. I'm willing to put the effort in fixing powerTop> and pushing this library to distributions. What's the problem?>> You are entering a very dangerous precedence by stating that the raw> format is now the ABI, end of story. This will bite us in the future. It> just did, and it will just get worse.>>>>> As things look like from my side it appears you want to keep ftrace a messy,>> forked project with no regard to perf based tooling and this will fragment>> Linux instrumentation, the many technical disadvantages be damned.>> ftrace is not a fork and never was. To be a fork, we need a common> ancestor. Ftrace and perf do not have that. Perf was created (after> Ftrace) to profile events, and did so very well. It just happened to> expand into the tracing area, where you want me to abandon all my ftrace> work and rewrite it on top of something that was not designed to do> tracing.>> Now that perf has entered the tracing field, I would be happy to bring> the two together. But we disagree on how to do that. I will not drop> ftrace totally just to work on perf. There's too many users of ftrace> that want enhancements, and I will still support that. The reason being> is that I honestly do not believe that perf can do what these users want> anytime in the near future (if at all). I will not abandon a successful> project just because you feel that it is a fork.

We have invested heavily in using ftrace. We chose to use ftracebecause it was maintained upstream, fast, simple to use, and had allthe features we were already relying on from ktrace, and then some.When we last measured (admittedly quite a while ago now), perf hadabout 5x slower write latency than ftrace. This was probably thebiggest thing that stopped us from considering it.

Perf might improve its tracing story (I'm sure it already has, butwe've been playing ostrich a bit in order to get work done), and I'malso in agreement that bringing them closer together is a Good Thing.But if ftrace simply disappears, that would create a lot of work forus. We are every day depending on ftrace and infrastructure built ontop of ftrace more and more to make Google faster. We can cope withincremental improvements. Wholesale ripping would force us to forkthis part of the kernel, as we have too much invested in ftrace.