git-grep while excluding files in a blacklist

git-grep while excluding files in a blacklist

Does git-grep allow searching for a pattern in all files *except*
files matching a pattern. E.g. in our project we have multiple DLL's
in git, but when searching I would like to exclude these for speed. Is
that possible with git-grep?

Re: git-grep while excluding files in a blacklist

On Tue, Jan 17, 2012 at 4:14 PM, Dov Grobgeld <[hidden email]> wrote:
> Does git-grep allow searching for a pattern in all files *except*
> files matching a pattern. E.g. in our project we have multiple DLL's
> in git, but when searching I would like to exclude these for speed. Is
> that possible with git-grep?

Not from command line, no. You can put "*.dll" to .gitignore file then
"git grep --exclude-standard".
--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: git-grep while excluding files in a blacklist

> On Tue, Jan 17, 2012 at 4:14 PM, Dov Grobgeld <[hidden email]> wrote:
>> Does git-grep allow searching for a pattern in all files *except*
>> files matching a pattern. E.g. in our project we have multiple DLL's
>> in git, but when searching I would like to exclude these for speed. Is
>> that possible with git-grep?
>
> Not from command line, no. You can put "*.dll" to .gitignore file then
> "git grep --exclude-standard".

Re: git-grep while excluding files in a blacklist

> Nguyen Thai Ngoc Duy <[hidden email]> writes:
>
>> On Tue, Jan 17, 2012 at 4:14 PM, Dov Grobgeld <[hidden email]> wrote:
>>> Does git-grep allow searching for a pattern in all files *except*
>>> files matching a pattern. E.g. in our project we have multiple DLL's
>>> in git, but when searching I would like to exclude these for speed. Is
>>> that possible with git-grep?
>>
>> Not from command line, no. You can put "*.dll" to .gitignore file then
>> "git grep --exclude-standard".
>
> No rush, but is this something we would eventually want to handle with the
> negative pathspec?

Definitely. But because I'm stuck at adding "seen" feature from
match_pathspec_depth to tree_entry_interesting, that probably won't
happen this year. Adding "--exclude=<pattern>" to git-grep is a more
plausible option.
--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]More majordomo info at http://vger.kernel.org/majordomo-info.html

I think the .gitattributes mechanism is a better place to configure this, as the
files I with to exclude from grep are always the same files (test fixtures
mainly, minified library code also). I often want to make git-diff be quieter
about them too, so I'll be setting the -diff attribute already.

I've attached a patch, which adds the "grep" attribute, which can (for now) only
be unset. This has the same effect for me as my original patch [1], but should
also improve life for people with large blobs as described above.

In future we could extend the attribute to give a meaning to set values, but I'm
not yet sure what that would look like. We could also add an --exclude=<foo>
flag to grep for people who want to configure grep on a per-run basis, but I
think that is a much less common desire.

The failing test attached to this patch is a symptom of a larger issue with the
way that git-grep handles objects that are not at the root of the repository. A
more obvious symptom can be revealed by comparing the output of:

git grep int HEAD:./builtin

cd builtin; git grep int HEAD:./

The problem is that grep doesn't correctly separate the path from the revision
part of the spec. It's currently unobvious to me how to fix this but I hope
someone more familiar with the code (Nguyen or Junio) might be able to see a
way.

To set -grep on an file in .gitattributes will cause that file to be
skipped completely while grepping. This can be used to reduce the number
of false positives when your repository contains files that are
uninteresting to you, for example test fixtures, dlls or minified source
code.

The other approach considered was to allow an --exclude flag to grep at runtime,
however that better serves the less common use-case of wanting to customise the
list of files per-invocation rather than statically.

+ATTRIBUTES
+----------
+
+grep::
+ If the grep attribute is unset on a file using the linkgit:gitattributes[1]
+ mechanism, then that file will not be searched.
+
Examples
--------

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index a85b187..3ffcec7 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -869,6 +869,15 @@ If this attribute is not set or has an invalid value, the value of the
`gui.encoding` configuration variable is used instead
(See linkgit:git-config[1]).

+Configuring files to search
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+`grep`
+^^^^^^
+
+If the attribute `grep` is unset for a file then linkgit:git-grep[1]
+will ignore that file while searching for matches.
+

Re: [PATCH] Don't search files with an unset "grep" attribute

> ---8<---
>
> To set -grep on an file in .gitattributes will cause that file to be
> skipped completely while grepping. This can be used to reduce the number
> of false positives when your repository contains files that are
> uninteresting to you, for example test fixtures, dlls or minified source
> code.

Please reword this to describe the problem being solved first (why it
needs to be solved, in what situation you cannot live without the
feature), and then describe the approach you used to solve it.

in order to avoid uselessly spewing garbage at you while reminding you
that the command is not showing the whole truth and you _might_ want to
look into the elided file if you wanted to with "grep -a world hello.o".
Compared to this, it feels that the result your patch goes too far and
hides the truth a bit too much to my taste. Maybe it is just me.

Perhaps you, or all participants of your particular project, usually do
not want to see any grep hits from minified.js, but you may still want to
be able to say "I want to dig deeper and make sure I have copyright
strings in all files", for example. It is unclear how you envision to
support such a use case building on top of this patch.

Your "attributes only" is not an acceptable solution in the longer run,
even though it is a good first step (i.e. "attributes first and other
enhancement later"). There should be an easy way to get the best of both
worlds.

> The other approach considered was to allow an --exclude flag to grep at
> runtime, however that better serves the less common use-case of wanting
> to customise the list of files per-invocation rather than statically.

I doubt that it is justifiable to call per-invocation "the less common".

By the way, if the uninteresting ones are dll and minified.js, I wonder
why it is insufficient to mark them binary, i.e. uninteresting for the
purpose of all textual operations not just grep but also for diff.

I am *not* going to ask why they are treated as source and tracked by git
to begin with.

[PATCH] Don't search files with an unset "grep" attribute

Junio C Hamano <gitster <at> pobox.com> writes:
> Please reword this to describe the problem being solved first (why it
> needs to be solved, in what situation you cannot live without the
> feature), and then describe the approach you used to solve it.
>

Done — I also removed the extraneous braces from the patch.

>
> in order to avoid uselessly spewing garbage at you while reminding you
> that the command is not showing the whole truth and you _might_ want to
> look into the elided file if you wanted to with "grep -a world hello.o".
> Compared to this, it feels that the result your patch goes too far and
> hides the truth a bit too much to my taste. Maybe it is just me.

I used to use this approach, hooking into the "diff" attribute directly to mark
a file as binary, however that was clearly a hack. When developing this patch I
went through a few iterations, one in which -grep meant "treat this file as
binary", however explaining that in the man page is subtle and ugly: "HINT: you
might want to set a file as binary because you don't want to see results from it
when grepping". It's much more obvious to have -grep mean "don't show me
results".

A nicer alternative could be to allow "grep=binary" (and for completeness
"grep=text") in addition to "-grep". Then people who want to see matches but not
the contents of the matches can tell grep that their files are "binary". It
would also make sense to add "grep=binary" to the binary macro-attribute. We
could even extend the system arbitrarily to allow something like the textconv
attributes of git-diff... one step at a time is probably better though.

> Perhaps you, or all participants of your particular project, usually do
> not want to see any grep hits from minified.js, but you may still want to
> be able to say "I want to dig deeper and make sure I have copyright
> strings in all files", for example. It is unclear how you envision to
> support such a use case building on top of this patch.

I think that it would be very reasonable to add a flag to grep to tell it to
ignore the attribute temporarily (like --no-textconv on git-diff) and update the
"-a" shorthand to imply "--text --no-exclude-attribute".

Yours,
Conrad

---8<---

Git grep is used by developers to search the code stored in their repositories,
however it can give noisy results when the repository contains files that are
not of direct interest to development. Examples of such files include test
fixtures, dlls, or minified source code.

To help these developers search efficiently, git grep will now use the
gitattributes mechanism to ignore all files with an unset "grep" attribute.

Another approach considered was to allow an --exclude flag to grep at runtime,
however this is more clunky to use when the set of files to be excluded is
fixed.

+ATTRIBUTES
+----------
+
+grep::
+ If the grep attribute is unset on a file using the linkgit:gitattributes[1]
+ mechanism, then that file will not be searched.
+
Examples
--------

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index a85b187..3ffcec7 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -869,6 +869,15 @@ If this attribute is not set or has an invalid value, the value of the
`gui.encoding` configuration variable is used instead
(See linkgit:git-config[1]).

+Configuring files to search
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+`grep`
+^^^^^^
+
+If the attribute `grep` is unset for a file then linkgit:git-grep[1]
+will ignore that file while searching for matches.
+

Re: [PATCH] Don't search files with an unset "grep" attribute

>> in order to avoid uselessly spewing garbage at you while reminding you
>> that the command is not showing the whole truth and you _might_ want to
>> look into the elided file if you wanted to with "grep -a world hello.o".
>> Compared to this, it feels that the result your patch goes too far and
>> hides the truth a bit too much to my taste. Maybe it is just me.
>
> I used to use this approach, hooking into the "diff" attribute directly to mark
> a file as binary, however that was clearly a hack.

After thinking about this a bit more, I have to say I disagree that it is
a hack.

I agree that the issue you are trying to address is real, but I think the
approach taken by the patch is flawed on three counts:

- You cannot do "grep -a" equivalent once you set your "-grep" attribute;

- The user has flexibility to set "diff" and "grep" independently, which
is an unnecessary complication [*1*]; and

- The user does not get any hint that potential hits are elided, like
normal "grep" would do.

I also think we can fix the above problems, while simplifying the external
interface visible to the end users.

So let's step back a bit and take a look at the handling of files for
which you do not want to see patch output and/or you do not want to see
grep hits, in a fictional but realisitic use scenario.

I am building a small embedded applicance that links with a binary blob
firmware, xyzzy.so, supplied by an upstream vendor, and this blob is
updated from time to time. In order to keep track of what binary blob went
into which product release, this binary blob is stored in the repository
together with the source files. Because it is useless to view binary
changes in "git log -p" output, I would naturally mark this as "binary".

Now, is it realistic to expect that I might want to see "grep" hits from
this binary blob? I do not think so [*2*].

When I say "xyzzy.so is a binary" in my .gitattributes file, according to
the current documentation, I am saying that "do not run diff, and also do
not run EOL conversion on this file" [*3*], but from the end user's point
of view, it is natural that I am entitled to expect much more than that.
Without having to know how many different kinds of textual operations the
SCM I happen to be using supports, I am telling it that I do not want to
see _any_ textual operations done on it. If a new version of "git grep"
learns to honor the attributes system, I want my "this is binary" to be
considered by the improved "git grep".

If you think about it this way from the very high level description of the
problem, aka, end user's point of view, it is fairly clear that tying the
"binary" attribute to "git grep" to allow us to override the built-in
buffer_is_binary() call you see in grep.c gives the most intuitive result,
without forcing the user to do anything more than they are already doing.

Because "binary" decomposes to "-diff" and "-text", we have two choices.
We _could_ say "-text" should be the one that overrides buffer_is_binary()
autodetection, but using "-diff" instead for that purpose opens the door
for us to give our users even more intuitive and consistent experience.

Suppose that this binary blob firmware came with an API manual formatted
in PDF, xyzzy.pdf, also supplied by the vendor. It is also kept in the
repository, but again, running textual diff between updated versions of
the PDF documentation would not be very useful. I however may have a
textconv filter defined for it to grab the text by running pdf2ascii.

Now if my "git show --textconv xyzzy.pdf" has an output line that says a
string "XYZZY API 2.0" was added to the current version, wouldn't it be
natural for me to expect that "git grep --textconv 'XYZZY API' xyzzy.pdf"
to find it [*4*]?

As an added bonus, if you truly want to omit _any_ hit from "git grep" for
your minified.js files, you can easily do so. Just define a textconv
filter that yields an empty string for them, and "grep --textconv" won't
produce any matches, not even "Binary file ... matches".

So I am inclined to think that reusing the infrastructure we already have
for the `diff` attribute as much as possible would be a better design for
this new feature you are proposing.

The name of the attribute `diff` is somewhat unfortunate, but the end user
interface to the feature at the highest level is already called "binary"
attribute (macro), and our low-level documentation can give precise
meaning of the attribute while alluding to the origin of the name so that
intelligent readers would understand it pretty easily (see attached
patch).

Besides, we already tell the users in "Marking files as binary" section to
mark them with "-diff" in the attributes file if they want their contents
treated as binary.

[Footnotes]

*1* An unused flexibility like this does nothing useful, other than
forcing the user to flip two attributes always as a pair, when one should
suffice.

*2* The essence of my previous message was "why it is insufficient to mark
them binary, i.e. uninteresting for the purpose of all textual operations
not just grep but also for diff." which was unanswered.

*3* This is because "binary" is defined as an attribute macro that expands
to "-diff -text".

*4* This also suggests that the infrastructure we already have for driving
"git diff" is a good match for "git grep". The "funcname" patterns, which
"git grep -p" already can borrow and use from "diff" infrastructure, is
another indication that using `diff` is a good match for "git grep".

-The attribute `diff` affects how 'git' generates diffs for particular
-files. It can tell git whether to generate a textual patch for the path
-or to treat the path as a binary file. It can also affect what line is
+The attribute `diff` affects if `git` treats the contents of file as text
+or binary. Historically, `git diff` and its friends were the first to use
+this attribute, hence its name, but the attribute governs textual
+operations in general, including `git grep`.
+
+For `git diff` and its friends, differences in contents marked as text
+produces a textual patch, while differences in non-text are reported only
+as "Binary files differ". For `git grep`, matches in contents marked as
+text are reported by showing lines that contain them, while matches in
+non-text are reported only as "Binary file ... matches".
+
+This attribute can also affect what line is
shown on the hunk header `@@ -k,l +n,m @@` line, tell git to use an
external command to generate the diff, or ask git to convert binary
files to a text format before generating the diff.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Don't search files with an unset "grep" attribute

On Mon, Jan 23, 2012 at 10:59:45PM -0800, Junio C Hamano wrote:

> Conrad Irwin <[hidden email]> writes:
> > I used to use this approach, hooking into the "diff" attribute directly to mark
> > a file as binary, however that was clearly a hack.
>
> After thinking about this a bit more, I have to say I disagree that it is
> a hack.

I kind of agree.

The biggest problem is that the name is wrong. The "diff.*.command"
option really is about generating a diff between two blobs of a certain
type. But "diff.*.textconv" and "diff.*.binary" are really just
attributes of the file, and may or may not have to do with generating a
diff. Ditto for diff.*.funcname, I think.

You argue, and I agree, that if we are talking about attributes of the
files and not diff-specific things, then other parts of git can and
should make use of that information.

I think it would be a no-brainer that those type attributes should apply
to "git grep".

So maybe the first step on this path would be to introduce something
like "filetype" as a new attribute, have "diff" respect its settings,
and recommend people set up filetypes as appropriate. Or maybe that just
makes things more confusing in the long run, and we are better off
simply accepting that the name is slightly misleading. But either way,
it seems clear that git should be respecting gitattributes at the very
least to mark files as binary (and I think we already use funcname
patterns; textconv is a slightly stickier subject, so I'll address that
below).

But what I'm not sure I agree with is that the idea of "I don't want to
include path X in my grep" maps to "just mark the file as binary". For
example, in git we carry around a lot of code in compat/ that comes from
other places. I generally don't want to see grep results from
compat/nedmalloc/, because that isn't git code.

But should I mark everything in compat/nedmalloc as binary? I don't
think so. I _do_ want to see changes in nedmalloc in "git log" or "git
diff". They don't bother me because they're infrequent, and we still
want to produce regular text patches for the list when they come up. But
because nedmalloc contains a lot of lines of code (even though they
don't change a lot), it happens to produce a lot of uninteresting
matches when grepping.

> - The user has flexibility to set "diff" and "grep" independently, which
> is an unnecessary complication [*1*]; and

In the nedmalloc case, if we are to have "grep" and "diff" attributes
that behave similarly, you potentially do want to set them differently.
It would be nice to be able to treat them differently in the cases you
wanted to, but not _have_ to do so. Attribute macros can almost
implement this. You could add "-grep" to binary. But you can't (as far
as I know) do "macro=foo" to handle arbitrary diff drivers. I suspect we
could extend the rules to allow macros that take an argument and pass it
to their constituent attributes.

However, I think this is the wrong road to go down. You would want
macros like this _if_ you have grep and diff attributes that basically
do the same thing (e.g., marking a file as binary for diff versus binary
for grep). But I think that's a wrong road to go down. More likely a
file is binary or it is not binary, and the problem is conflating "file
is binary" with "I do not usually want to grep this file".

I'd much rather see grep inherit diff's file attributes unconditionally
(whether we still call them "diff" or not), and add a grep attribute
that is about "usually this is worth grepping". And then have a
tri-state command-line option for grep to either include uninteresting
things, exclude them, or to give terse output for them (mention that
there were matches in foo.c, but not each one). Probably defaulting to
terse output.

That makes the case you presented work out of the box: things marked as
binary for diffing look binary to grep, and we give the usual terse
"binary file matches" output. The user doesn't have to do anything. For
more complex cases like nedmalloc, you can still achieve the "this is
text, but it's usually boring" behavior. And if you really want to do a
thorough grep, you can just "git grep --exclude=none".

> So let's step back a bit and take a look at the handling of files for
> which you do not want to see patch output and/or you do not want to see
> grep hits, in a fictional but realisitic use scenario.
> [...]

I think this is a nice user story, and it fits in with your suggested
git behavior. But I think there are other stories, too, like the
nedmalloc one. And it would be nice to make them work without hurting
the simplicity of the case you mentioned.

> If you think about it this way from the very high level description of the
> problem, aka, end user's point of view, it is fairly clear that tying the
> "binary" attribute to "git grep" to allow us to override the built-in
> buffer_is_binary() call you see in grep.c gives the most intuitive result,
> without forcing the user to do anything more than they are already doing.

This is not a complaint about the core of your point, but rather an
aside that should be considered: how many people are really using the
binary macro attribute? Personally, I never use it, because when I mark
something with a "diff" attribute, it is because I am telling git about
a specific diff driver (usually textconv). Otherwise I don't bother
setting attributes at all, because git's binary detection tends to be
good. This leaves me without setting "-text", of course, but I don't
generally care because I don't do CRLF conversion at all.

So I think it is worth considering not just people setting "binary", but
how users of just "diff" (both "-diff" and "diff=foo") will want things
to work.

> Suppose that this binary blob firmware came with an API manual formatted
> in PDF, xyzzy.pdf, also supplied by the vendor. It is also kept in the
> repository, but again, running textual diff between updated versions of
> the PDF documentation would not be very useful. I however may have a
> textconv filter defined for it to grab the text by running pdf2ascii.
>
> Now if my "git show --textconv xyzzy.pdf" has an output line that says a
> string "XYZZY API 2.0" was added to the current version, wouldn't it be
> natural for me to expect that "git grep --textconv 'XYZZY API' xyzzy.pdf"
> to find it [*4*]?

This is an interesting concept. As a user of textconv, I already have
some specialized grep tools for matching inside binary files (e.g., I
have a tool that greps within exif tags of images). But being able to do
so with "git grep", and even at an arbitrary revision, is kind of neat.

I would worry about turning it on by default, since the results could be
misleading. In particular, your pattern "foo" might be in the binary
file but not in the textconv'd version, leading you to think you had
found all instances of "foo" but had not (or much more subtle, things
like line breaks really matter during the conversion if you are going to
be using "grep -C").

Making it available by "--textconv" seems reasonable to me, though. The
only inconsistency is that it's on by default for "git show", but would
not be for "git grep".

Perhaps I am being overly paranoid on the "misleading" bit above. It
seems to me that grep has the room to be a lot more subtle, because an
omission from the output is considered "did not match". But you could
construct equally weird scenarios for "git show" (e.g., you changed
"foo" to "bar" but that part did not appear in the textconv portion.
Which is really a quality-of-implementation issue for your textconv
filter).

> As an added bonus, if you truly want to omit _any_ hit from "git grep" for
> your minified.js files, you can easily do so. Just define a textconv
> filter that yields an empty string for them, and "grep --textconv" won't
> produce any matches, not even "Binary file ... matches".

Clever. But then you will never ever see a diff for that file, either,
because we will consider all changes to be empty (actually, I didn't
check, but you may get the diff header without any content, similar to
the stat-dirty entries).

(admittedly I have no clue if gitattributes actually supports anything like this)

I guess my point is as a user, I've gravitated to "gitattributes is about files in my repo, gitconfig is about Git's behavior" (though this is a grey area).

To partially answer my own question: one advantage of putting the filetype information in a config file is it allows system- and user-wide filetype settings. In my personal experience I've always handled that information on a per-repository basis, but that doesn't mean everyone would want to.

Re: [PATCH] Don't search files with an unset "grep" attribute

On 01/25/2012 10:46 PM, Jeff King wrote:
> But what I'm not sure I agree with is that the idea of "I don't want to
> include path X in my grep" maps to "just mark the file as binary".

Anybody who wants this policy can simply set

[attr]binary -diff -text -grep

If they want finer granularity, they can adjust the settings for
particular file types or for particular files.

> But should I mark everything in compat/nedmalloc as binary? I don't
> think so. I _do_ want to see changes in nedmalloc in "git log" or "git
> diff". They don't bother me because they're infrequent, and we still
> want to produce regular text patches for the list when they come up. But
> because nedmalloc contains a lot of lines of code (even though they
> don't change a lot), it happens to produce a lot of uninteresting
> matches when grepping.

I think decisions such as whether to include an imported module in "git
diff" output is a personal preference and should not be decided at the
level of the git project. The in-tree .gitattributes files should, by
and large, just *describe* the files and leave it to users to associate
policies with the tags (or at least make it possible for users to
override the policies) via .git/info/attributes. For example, the
repository could set an "external=nedmalloc" attribute on all files
under compat/nedmalloc, and users could themselves configure a macro
"[attr]external -diff -grep" (or maybe something like
"[attr]external=nedmalloc -diff -grep") if that is their preference.

> It would be nice to be able to treat them differently in the cases you
> wanted to, but not _have_ to do so. Attribute macros can almost
> implement this. You could add "-grep" to binary. But you can't (as far
> as I know) do "macro=foo" to handle arbitrary diff drivers. I suspect we
> could extend the rules to allow macros that take an argument and pass it
> to their constituent attributes.

Is it really common to want to use the same argument on multiple macros
without also wanting to set other things specifically? If not, then
there is not much reason to complicate macros with argument support.

for the main file types in my repository, and it is not very cumbersome.

"type-python" and "type=python" seem redundant but they are not.
"type-python" is needed so that it can be used as a macro.
"type=python" makes it easier to inquire about the type of a file using
something like "git check-attr type -- PATH" rather than having to
inquire about each possible type-* attribute. It might be nice to
support a slightly extended macro definition syntax like

For "binary", probably not much. But for textconv, it is all about the
split between attributes and config, as mentioned below:

> To partially answer my own question: one advantage of putting the
> filetype information in a config file is it allows system- and
> user-wide filetype settings. In my personal experience I've always
> handled that information on a per-repository basis, but that doesn't
> mean everyone would want to.

Right. Setting things system-wide instead of per-repo is one advantage.
But more important is that attributes are not per-repo, but rather
"per-project". They get committed, and everybody who works on the
project shares them.

In your example, the gitattributes get committed, and the project is
mandating "you _will_ use pdf2text to view diffs of these files". But
that may not be appropriate for everybody who clones. Somebody may have
a different pdf-to-text converter. Somebody may simply have pdf2txt at a
different path, or need different options. Or somebody may want to skip
it altogether and use an external diff command, or even just see the
files as binary.

By splitting the information across the two files, the project gets to
say "this file is of type pdf", and then each user gets to decide "how
do I want to diff pdf files?"

Re: [PATCH] Don't search files with an unset "grep" attribute

> I think decisions such as whether to include an imported module in "git
> diff" output is a personal preference and should not be decided at the
> level of the git project.

You're right. I thought of it as an annotation that the project could
mark via .gitattributes, or the user could mark via .git/info/attributes.
But that is not following the right split of responsibility for
attributes and config. The attributes should annotate "this isn't really
part of the regular git code base" or "this is really part of the
nedmalloc codebase". And then the _config_ should say "when I am
grepping, I am not interested in nedmalloc". I.e.:

# mark a set of paths with an attribute
echo "compat/nedmalloc external" >>.gitattributes

# and then ignore that attribute for this grep
git grep --exclude-attr=external

# or for all greps
git config --add grep.exclude external

and git doesn't even have to care about what the attribute is called.
It's between the project and the user how they want to annotate their
files, and how they want to feed them to grep.

Or any other program, for that matter. I wonder if this could also be a
more powerful way of grouping files to be included or excluded from diff
pathspecs. Something like (and I'm just talking off the top of my head,
so there may be some syntactic conflicts here):

# and then consider the tagged files to be a group, and look only at
# that group
git log :attr:test-script

# ditto, but imagine we had the negative pathspecs Duy has proposed
git log :~attr:test-script

That seems kind of cool to me. But maybe it is getting into crazy
over-engineering. I like the idea that we don't need a new option to
grep or diff; rather it is simply a new syntax for mentioning paths.

> The in-tree .gitattributes files should, by and large, just *describe*
> the files and leave it to users to associate policies with the tags
> (or at least make it possible for users to override the policies) via
> .git/info/attributes. For example, the repository could set an
> "external=nedmalloc" attribute on all files under compat/nedmalloc,
> and users could themselves configure a macro "[attr]external -diff
> -grep" (or maybe something like "[attr]external=nedmalloc -diff
> -grep") if that is their preference.

So obviously I took what you were saying here and ran with it above. But
I do disagree with one thing here: the attributes should be giving some
tag to the paths, but the actual decision about whether to grep should
be part of the _config_. That's the usual split we have for all of the
other attributes, and I think it makes sense and has worked well.

> Is it really common to want to use the same argument on multiple macros
> without also wanting to set other things specifically? If not, then
> there is not much reason to complicate macros with argument support.

I dunno. I admit my attribute usage tends to just match by extension,
and I generally only have one or two such lines.

I think it's not a big deal if you are making your own macros. I was
more concerned that people would want to use the "binary" macro to get
the "-grep" automagic, but could not do so because they don't want
"-diff", but rather "diff=foo".

Anyway, after reading your response and thinking on it more, I think
"-grep" is totally the wrong way to go. If the files are marked binary,
then grep should be respecting "-diff" or the "diff.*.binary" config. If
we want to do more advanced exclusion, then the right place for that is
the config file (or the weird :attr pathspec thing I mentioned above).

> "type-python" and "type=python" seem redundant but they are not.
> "type-python" is needed so that it can be used as a macro.
> "type=python" makes it easier to inquire about the type of a file using
> something like "git check-attr type -- PATH" rather than having to
> inquire about each possible type-* attribute. It might be nice to
> support a slightly extended macro definition syntax like
>
> [attr]type=python text diff=python check-ws
> *.py type=python
>
> [attr]type=makefile text diff check-ws -check-tab
> Makefile.* type=makefile
>
> (i.e., macros that are only triggered for particular values of an
> attribute).

I don't think there's any semantic reason why that is not workable. It's
simply not syntactically allowed at this point.

Re: [PATCH] Don't search files with an unset "grep" attribute

> On Mon, Jan 23, 2012 at 10:59:45PM -0800, Junio C Hamano wrote:
>
>> Conrad Irwin <[hidden email]> writes:
>> > I used to use this approach, hooking into the "diff" attribute directly to mark
>> > a file as binary, however that was clearly a hack.
>>
>> After thinking about this a bit more, I have to say I disagree that it is
>> a hack.
>
> I kind of agree.
>
> The biggest problem is that the name is wrong. The "diff.*.command"
> option really is about generating a diff between two blobs of a certain
> type. But "diff.*.textconv" and "diff.*.binary" are really just
> attributes of the file, and may or may not have to do with generating a
> diff. Ditto for diff.*.funcname, I think.
>
> You argue, and I agree, that if we are talking about attributes of the
> files and not diff-specific things, then other parts of git can and
> should make use of that information.
>
> So if this was all spelled:
>
> $ cat .gitattributes
> *.pdf filetype=pdf
> $ cat .git/config
> [filetype "pdf"]
> binary = true
> textconv = pdf2txt
>
> I think it would be a no-brainer that those type attributes should apply
> to "git grep".

I think this discussion has, instead of forking into two equally
interesting subthreads, veered to a more intellectually stimulating
tangent and we ended up losing focus.

Regardless of what to do with "I do not want to grep in these types of
files" and "I want textconv applied when grepping in these types", which
would be new attributes to implement two new features, I would like to see
us first concentrate on fixing the "binary" issue. When somebody tells us
"Your autodetection may screw it up, but this file is binary; just show
'Binary files differ.' when comparing." with "-diff" (or "binary"), we
should honor that when "git grep" decides if it should take the 'Binary
file matches' codepath. We currently do not, and it clearly is a bug.

This is especially made somewhat urgent because I do not want a half-baked
"two pathspecs" approach that only "git grep" knows about when we add the
support for "git grep --exclude-path=...".

We should have to teach the underlying machinery that matches pathspec
about negative pathspec entries only once. After we have done so, all the
callers, not just "git grep", should be able to take advantage of the
change by just learning to place negative pathspec entries in the "struct
pathspec" they pass to the machinery. Doing anything else will lead to
madness of adding ad-hoc "here we should further filter with the other
negative 'struct pathspec'" in each and every application.

But I suspect that it would not materialize anytime soon. And I also
suspect that the correct handling of 'Binary file matches', which is a
pure bugfix, should solve the original issue started these threads 90% in
practice.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]More majordomo info at http://vger.kernel.org/majordomo-info.html

> Regardless of what to do with "I do not want to grep in these types of
> files" and "I want textconv applied when grepping in these types", which
> would be new attributes to implement two new features, I would like to see
> us first concentrate on fixing the "binary" issue. When somebody tells us
> "Your autodetection may screw it up, but this file is binary; just show
> 'Binary files differ.' when comparing." with "-diff" (or "binary"), we
> should honor that when "git grep" decides if it should take the 'Binary
> file matches' codepath. We currently do not, and it clearly is a bug.

Right. It may have been lost in the verbosity of what I wrote in my
previous email, but I completely agree. With the caveat that one should
also respect "diff=foo" coupled with "diff.foo.binary = true" as making
something binary. But that is already handled transparently by the
userdiff.[ch] code (which seems like the obvious entry point for
grep to use for attribute lookup, and which we already use there for
funcname lookup).

The trivial-ish patch is below.

> We should have to teach the underlying machinery that matches pathspec
> about negative pathspec entries only once. After we have done so, all the
> callers, not just "git grep", should be able to take advantage of the
> change by just learning to place negative pathspec entries in the "struct
> pathspec" they pass to the machinery. Doing anything else will lead to
> madness of adding ad-hoc "here we should further filter with the other
> negative 'struct pathspec'" in each and every application.

Yes, I agree.

> But I suspect that it would not materialize anytime soon. And I also
> suspect that the correct handling of 'Binary file matches', which is a
> pure bugfix, should solve the original issue started these threads 90% in
> practice.

Also agree. Let's fix the bug and then give it some time to see whether
people really want more explicit exclusions.

Here's the bug-fix patch. Not quite ready for inclusion, as it obviously
needs tests and a commit message. Also, we can cache the result of the
userdiff lookup so the funcname code doesn't have to look it up again.

Re: [PATCH] Don't search files with an unset "grep" attribute

On Wed, Feb 01, 2012 at 03:20:05AM -0500, Jeff King wrote:

> Here's the bug-fix patch. Not quite ready for inclusion, as it obviously
> needs tests and a commit message. Also, we can cache the result of the
> userdiff lookup so the funcname code doesn't have to look it up again.

Actually, it's a little bit more complicated. I was looking at a
slightly old version of grep.c. Since 0579f91 (grep: enable threading
with -p and -W using lazy attribute lookup, 2011-12-12), the lookup
happens in lots of sub-functions, and locking is required.

So this is what the patch looks like with proper locking and caching of
the looked-up driver. It's quite messy because the cached driver pointer
has to get passed around quite a bit. And I'm not sure it buys that much
in practice. The cost of attribute lookup _is_ noticeable (which I'll
discuss below), but funcname lookup only happens when we get a grep hit.
So unless you are searching for something extremely common, you're only
going to do a lookup very occasionally (compared to the load of actually
searching through the files). So all of the messiness and caching may
not be worth the effort, as I wasn't able to measure a performance gain.

But there's more. Respecting binary attributes does mean looking up
attributes for _every_ file. And that has a noticeable impact. My
best-of-five for "git grep foo" on linux-2.6 went from 0.302s to 0.392s.
Yuck.

Part of the problem, I suspect, is that the attribute lookup code is
optimized for locality. We only unwind as much of the stack as we need,
so looking at "foo/bar/baz.c" after "foo/bar/bleep.c" is much cheaper
than looking at "some/other/directory.c". But with threaded grep, that
locality is likely lost, as we are mixing up attribute requests from
different threads.

Given that binary lookup means we need every file's gitattribute, it
might be better to look them up serially at the beginning of the
program, and then pass the resulting userdiff driver to grep_buffer
along with each path.

Re: [PATCH] Don't search files with an unset "grep" attribute

On Wed, Feb 1, 2012 at 1:10 AM, Jeff King <[hidden email]> wrote:
> On Wed, Feb 01, 2012 at 03:20:05AM -0500, Jeff King wrote:
>
> Actually, it's a little bit more complicated. I was looking at a
> slightly old version of grep.c. Since 0579f91 (grep: enable threading
> with -p and -W using lazy attribute lookup, 2011-12-12), the lookup
> happens in lots of sub-functions, and locking is required.

Heh, you just beat me to it.

> But there's more. Respecting binary attributes does mean looking up
> attributes for _every_ file. And that has a noticeable impact. My
> best-of-five for "git grep foo" on linux-2.6 went from 0.302s to 0.392s.
> Yuck.

The first time I introduced this behaviour[1], I made it conditional
on a preference — those who wanted "good" grep could set the
preference, while those who wanted "fast" grep could not. I think
that's not a good idea, though if the performance issues are
show-stoppers, I'd suggest the opposite preference (so speed-freaks
can disable the checks).

Tests from [1] included below in case they're still useful (they pass
with your change)

Re: [PATCH] Don't search files with an unset "grep" attribute

> Part of the problem, I suspect, is that the attribute lookup code is
> optimized for locality. We only unwind as much of the stack as we need,
> so looking at "foo/bar/baz.c" after "foo/bar/bleep.c" is much cheaper
> than looking at "some/other/directory.c". But with threaded grep, that
> locality is likely lost, as we are mixing up attribute requests from
> different threads.
>
> Given that binary lookup means we need every file's gitattribute, it
> might be better to look them up serially at the beginning of the
> program, and then pass the resulting userdiff driver to grep_buffer
> along with each path.

Yeah, that was my impression when the performance of threaded grep was
discussed, which was before this "let's honor binary attribute".
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Don't search files with an unset "grep" attribute

> > But there's more. Respecting binary attributes does mean looking up
> > attributes for _every_ file. And that has a noticeable impact. My
> > best-of-five for "git grep foo" on linux-2.6 went from 0.302s to 0.392s.
> > Yuck.
>
> The first time I introduced this behaviour[1], I made it conditional
> on a preference — those who wanted "good" grep could set the
> preference, while those who wanted "fast" grep could not. I think
> that's not a good idea, though if the performance issues are
> show-stoppers, I'd suggest the opposite preference (so speed-freaks
> can disable the checks).

I've been able to get somewhat better performance by hoisting the
attribute lookup into the parent thread. That means it happens in order
(which lets the attr code's stack optimizations work), and there's no
lock contention.

Re: [PATCH] Don't search files with an unset "grep" attribute

On Wed, Feb 01, 2012 at 05:14:37PM -0500, Jeff King wrote:

> > The first time I introduced this behaviour[1], I made it conditional
> > on a preference — those who wanted "good" grep could set the
> > preference, while those who wanted "fast" grep could not. I think
> > that's not a good idea, though if the performance issues are
> > show-stoppers, I'd suggest the opposite preference (so speed-freaks
> > can disable the checks).
>
> I've been able to get somewhat better performance by hoisting the
> attribute lookup into the parent thread. That means it happens in order
> (which lets the attr code's stack optimizations work), and there's no
> lock contention.
>
> I'll post finished patches with numbers in a few minutes.

OK, here they are. After playing with some options, I'm satisfied this
is a sane way to do it. I don't think it's worth having a config option.
There is a measurable slowdown, but it's simply not that big.

2. When "-I" is given, we can actually check attributes _before_
loading the file or blob into memory. This can help with very large
binaries.

3. When "-I" is given but we have no attribute, we can stream the
beginning of the file or blob to check for binary-ness, and then
avoid loading the whole thing if it turns out to be binary.

I think (1) and (2) should be easy. Doing (3) is a little messier,
because binary detection happens inside grep_buffer, but we can hoist it
out. However, for large files, it might be nice to have a streaming grep
interface anyway, and (3) could be part of that.