This checker checks at the end of a constructor call whether all fields of the object were initialized.

Note that a significant portion of the code is for handling unions, for which there is very little support in the CSA. In most cases, all fields of a union is regarded as unknown. I checked these cases by regarding unknown fields as uninitialized.

The test files also contain checks for heap allocated objects and unions, but heap regions are mostly conjured in the CSA and unions have little to no support, so the tests for them and unions should work theoretically, but its not 100%. so the tests are not yet functional (used no-warning at places where a warning is expected).

I haven't marked @a.sidorin's comment as done, but it disappeared because I renamed the file, so here it is:

Could you please explain what is the logic of test naming?

To which I replied:

The test files follow the strict structure that for each test case we'll define a structure type and then call its constructor(s) in a function right after it (functions are used to avoid zero initializations).

To be honest, there is no real logic behind the naming of the functions, it is only to keep the ODR. I used numbers in an increasing order, however I later added there cases, so in between f23 and f24 I used f23p5 and so on.

I figured that the strict structure of the test files would avoid confusion. Do you find it distracting?

With regards to the function naming, the problem with incremental counts is that they get out of sync, like they did with fXpY and such, and if someone wants to keep the count incremental down the file, adding any new test will result in an unnecessarily large patch. Perhaps you could use void T_() for the test that calls T::T()?

Had a look at the actual code, came up with some comments, most are fairly minor, so good job! You did a good job explaining the overall idea, but a lot of specifics were difficult to understand, so i made a few comments on how to make the code a bit easier to read.

I know, and I tried to look for ways to split this checker into smaller logical chunks, but I didn't find a satisfying solution.

What i would have done:

A checker that straightforwardly iterates through immediate fields of the current object and reports uninitialized fields.

Add base classes support.

Heuristic for skipping sub-constructors could use a separate review.

Start skipping arrays and unions.

Add simple pointer dereference support - introduce chains of fields.

More heuristics - the one for symbolic regions, the one for system header fields, etc.

etc.

I'm asking for this because it's, like, actually difficult and painful to understand formal information in large chunks. One-solution-at-a-time would have been so much easier. It is really a bad idea to start by presenting a working solution; the great idea is to present a small non-working solution with a reasonable idea behind it, and then extend it "in place" as much as it seems necessary. It is very comfortable when parts of the patch with different "status" (main ideas, corner-case fixes, heuristics, refactoring) stay in separate patches. Alpha checkers are essentially branches: because our svn doesn't have branches, we have alpha checkers and off-by-default flags. Even before step 1, you should be totally fine with committing an empty checker with a single empty callback and a header comment - even on such "step 0." we would be able to discuss if your approach to the checker seems right or might have inevitable issues, and it could have saved a lot of time.

Essentially every part of the solution that you implement, if you think it might deserve a separate discussion, also deserves a separate patch. It is reasonable to split the work into logical chunks before you start it. It's very unlikely that you have a single idea that takes 500 non-trivial lines of code to express. Splitting up ideas so that they were easy to grasp is an invaluable effort. I had very positive experience with that recently, both as a coder and as a reviewer, so i encourage that a lot.

At a glance it looks like a great use case for an immutable list. It's easy to use the immutable list as a stack, and it's also O(1) to copy/move and take snapshots of it.

Note that you copy 10 pointers every time you copy or move a SmallVector. Small vectors don't make much sense as fields of actively used data structures - their main purpose is to live on the stack as local variables.

Do we really need the other field (IsDereferenced) to be mutable? Or maybe we can keep the whole object immutable? I.e., instead of dereference() we'll get a constructor that constructs a FieldChainInfo with the same chain but with the dereferenced flag set.

Please explain what these methods do in a more direct manner. Something like "This method returns true if an uninitialized field was found within the region. It should only be used with regions of union-type objects".

Type of an object pointed to by a void pointer is something that our path-sensitive engine is sometimes able to provide. It is not uncommon that a void pointer points to a concrete variable on the current path, and we may know it in the analyzer. I'm not sure this sort of check is necessary (i.e. require more information).

You may also want to have a look at getDynamicTypeInfo() which is sometimes capable of retrieving the dynamic type of the object pointed to by a pointer or a reference - i.e. even if it's different from the pointee type of the pointer.

Aha, ok, got it, so you're putting the warning at the end of the constructor and squeezing all uninitialized fields into a single warning by presenting them as notes.

Because for now notes aren't supported everywhere (and aren't always supported really well), i guess it'd be necessary to at least provide an option to make note-less reports before the checker is enabled by default everywhere. In this case notes contain critical pieces of information and as such cannot be really discarded.

I'm not sure that notes are providing a better user experience here than simply saying that "field x.y->z is never initialized". With notes, the user will have to scroll around (at least in scan-build) to find which field is uninitialized.

Notes will probably also not decrease the number of reports too much because in most cases there will be just one uninitialized field. Though i agree that when there is more than one report, the user will be likely to deal with all fields at once rather than separately.

So it's a bit wonky.

We might want to enable one mode or another in the checker depending on the analyzer output flag. Probably in the driver (RenderAnalyzerOptions()).

It'd be a good idea to land the checker into alpha first and fix this in a follow-up patch because this review is already overweight.

Please check the type explicitly, and then assert that the value is a Loc. There are a lot of nuances that you don't want to be dealing with here (eg. ObjCObjectPointerType, which can be present in C++ as well because Objective-C++ is a thing, is not a PointerType or a ReferenceType, but it is modeled with a Loc).

Are there many warnings that will be caused by this but won't be caused by conducting the check for the constructor-within-constructor that's currently disabled? Not sure - is it even syntactically possible to not initialize the base class?

Is this loop guaranteed to terminate? Is something like void *v; v = &v; possible?

I think this loop should unwrap the AST type on every iteration and dereference the pointer accordingly.

In general, i believe that every symbolic pointer dereference should be made with awareness of the type of the object we're trying to retrieve. Which is an optional argument for State->getSVal(R, ...), but it seems that it should be made mandatory because in a lot of cases omitting it causes problems.

I'd like to change the word "conjure" to something else because in most cases it won't be an actual SymbolConjured, but it'd be a SymbolRegionValue or SymbolDerived. Conjuring occurs in different circumstances.

Can we at least detect these situations and suppress the warning? It should be relatively easy for lambdas (captured variable mapping should be stored somewhere, as far as i remember), not sure how easy it'd be for double inheritance (there are fancy methods for scanning this stuff, but that's equivalent to actually implementing the correct warning message).

Weird warning messages are pretty scary to show to the users.

It'd be a good idea to land the checker into alpha first and fix this in a follow-up patch because this review is already overweight.

This doesn't check that the constructor on the parent stack frame is anyhow related to the current constructor, so it may introduce false negatives. For instance, a class may lazy-initialize a singleton but never store a reference to it within itself (because, well, it's a singleton, you can always obtain the reference to it). In this case we'll never find the bug in the constructor of the singleton.

I'm sorry that the patch became this long. It was an error on my part, and I completely get that the checker now takes an uncomfortably long time to read and understand. This is my first "bigger" project in the CSA, I learned a lot so far, and I'll definitely review how I should've handled developing and uploading it here better, so next time I will do my best to make my contributions a lot more reviewer-friendly.

edit: Not just review-wise, but development-wise. This could've easily turned out wrong as a whole, and if it was, it could've been avoided if I shared my thoughts here.

It's very unlikely that you have a single idea that takes 500 non-trivial lines of code to express.

That's true. Since this was the first time I wrote anything the StaticAnalyzer project, I often found myself almost completely rewriting everything. Looking back, I still could've uploaded the code in parts, so notes taken, will do better next time!

It'll be some time before I sort everything out you commented on, just wanted to leave this here.

[...]i guess it'd be necessary to at least provide an option to make note-less reports before the checker is enabled by default everywhere. In this case notes contain critical pieces of information and as such cannot be really discarded.

This can already be achieved with -analyzer-config notes-as-events=true. There no good reason however not to make an additional flag that would emit warnings instead of notes.

I'm not sure that notes are providing a better user experience here than simply saying that "field x.y->z is never initialized". With notes, the user will have to scroll around (at least in scan-build) to find which field is uninitialized.

This checker had a previous implementation that emitted a warning for each uninitialized field, instead of squeezing them in a single warning per constructor call. One of the major reasons behind rewriting it was that it emitted so many of them that it was difficult to differentiate which warning belonged to which constructor call. Also, in the case of the LLVM/Clang project, the warnings from this checker alone would be in the thousands.

Notes will probably also not decrease the number of reports too much because in most cases there will be just one uninitialized field.

While this is true to some extent, it's not uncommon that a single constructor call leaves 7 uninitialized -- in the LLVM/Clang project I found one that had over 30!
Since its practically impossible to determine whether this was a performance enhancement or just poor programming, the few cases where it is intentional, an object would emit many more warnings would make make majority of the findings (where an object truly had only 1 or 2 fields uninit) a lot harder to spot in my opinion.

In general, I think one warning per constructor call makes the best user experience, but a temporary solution should be implemented until there's better support for notes.

// no-crash is something I found in many other test files, in this case it symbolizes that the point of this test isn't to check whether the checker correctly finds or ignores an uninit field, but that it doesn't crash.

This can already be achieved with -analyzer-config notes-as-events=true.

Yep. But the resulting user experience seems pretty bad to me. It was a good quick proof-of-concept trick to test things, but the fact that notes aren't path pieces is way too apparent in case of this checker. So i guess this flag was a bad idea.

Also, in the case of the LLVM/Clang project, the warnings from this checker alone would be in the thousands.

This makes me a bit worried, i wonder what's the reason why does the checker shout so loudly on a codebase that doesn't seem to have actual uninitialized value bugs all over the place.

Are any of these duplicates found in the same constructor for the same field in different translation units? Suppose we discard the duplicates (if any) and warn exactly once (across the whole LLVM) for every uninitialized field (which is probably more than once per constructor). Then I wonder:

How many uninitialize fields would we be able to find this way?

How many of such fields were intentionally left uninitialized?

How many were found by deep inspection of the heap through field chains involving pointer dereference "links"?

(when i'm asking this sort of stuff, i don't mean you should look through thousands of positives, a uniform random sample of ~50 should be just fine)
(but we really do need this info before enabling stuff by default)

This can already be achieved with -analyzer-config notes-as-events=true.

Yep. But the resulting user experience seems pretty bad to me. It was a good quick proof-of-concept trick to test things, but the fact that notes aren't path pieces is way too apparent in case of this checker. So i guess this flag was a bad idea.

I totally agree, I meant that a note-less solution should only be a (arguably better) workaround just as notes-as-events=true.

Are any of these duplicates found in the same constructor for the same field in different translation units? Suppose we discard the duplicates (if any) and warn exactly once (across the whole LLVM) for every uninitialized field (which is probably more than once per constructor).

Sure, having the same field field reported from the same constructor call in different TUs happens quite a lot, but they can easily be uniqued.

In the checkers current state, meaning that one warning per ctor call, the LLVM/Clang project in pedantic mode resulted in 409 non-unique warnings, and using CodeChecker I found that 181 of these is unique.

How many uninitialize fields would we be able to find this way?

In its current state, in pedantic mode, after uniqueing ~581. Now that's quite a bit off from what I remembered "the warnings from this checker alone would be in the thousands", but having one warning per uninit field would still result in a ~320% increase in reports.

How many of such fields were intentionally left uninitialized?

(I checked around 95 reports a couple weeks back before uploading this diff)Almost all of them. Note that this is a very performance-critical project. From what I saw, many times several fields are left uninitialized, but these fields are inaccessible due to a kind field being asserted at their getter functions.
Also, I found cases where the constructor wasn't defaulted with = default;.

In fact, I struggled to find a case where a field was 100% left out by accident. Maybe in llvm/lib/IR/ConstantsContext.h, calling ConstantExprKeyType(ArrayRef<Constant *> Operands, const ConstantExpr *CE).

This is not the case however in other projects I checked. In Xerces for example every find was a true positive.

How many were found by deep inspection of the heap through field chains involving pointer dereference "links"?

This is somewhat harder to measure. The best I could come up with is uniqueing the fieldchains from the plist files with lexicographical sorting. Your question can be answered by measuring the length of the fieldchains. I found

3 fieldchains with the length of 5

18 fieldchains with the length of 4

62 fieldchains with the lenfth of 3

108 fieldchains with the length of 2

137 fieldchains with the length of 1

(that is a total of 328 unique fieldchains lexicographically)
Because heap objects are not modeled, only 3 of the 328 fieldchains contains pointer dereferences. However, reference objects are very common from what I saw, but sadly I can't give you an exact number on those.

I hope these answers were satisfying, but I'll follow up with more information if needed.

In this diff, I preferred placing TODOs in places where the logic of the checker would have changed too much. While I didn't have a strict rule for this in mind, I tried changing as little as possible in order to implement the fixes in smaller followup patches.

Now with that being said, I still did make numerous changes to address the inline comments, so here is a complete list of them:

I placed the checker back in alpha.cplusplus

While the main logic of the checker is unchanged, I did make some small additions: it turned out (thanks to @NoQ) that objects I wrongly referred to as fundamental were in fact of BuiltinType or EnumeralType. In the checker I defined these types as "primitive", as neither clang or the C++ standard defines a type called "primitive"

A new node is added to the directed tree: objects of MemberPointerType. I updated the documentation, and added a function to FindUninitializedFields but left the implementation to be done in a followup patch

String concatenations are all done with streams

FieldChainInfo now uses llvm::ImmutableList

FieldChainInfo is now completely immutable

FindUninitializedFields no longer checks lazily, and has only a single getUninitFields public non-special member function.

FindUninitializedFields now has a ProgramStateRef field instead of a bunch of managers (this added a dependency D46891)

Added asserts and llvm_unreachable at numerous locations

More comments for easier reading

For the test files:

Changed all non-member function names to eliminate numbering

Split cxx-uninitialized-object.cpp up, and placed pointer/reference tests in a different file

Added tests for

loc::ConcreteInt

Member pointers

BuiltinType and EnumeralType

Constructorcall within a constructor, without the two constructors being in relation, to highlight a false negative

C++11 member initializers.

I'll be the first to admit that I'm not 100% sure I implemented the change to llvm::ImmutableList perfectly, so I guess maybe it's worth taking a deeper look at (but it certainly works!). Shall I tie its lifetime to the checker object's lifetime?
Edit: I mean, the Factory's lifetime.
Of course, I rechecked the entire LLVM/Clang project and I'm happy to report that no crashes occured, and it emitted almost the same amount of warning (almost, because I used a newer version of clang).

I took a look at it, but I found that it'd change the implementation of isPointerOrReferenceUninit so much, I'd be a lot more comfortable if I could implement it in a smaller followup patch. I placed some TODOs in the code about this.

I think that would be a bad idea. For example, this checker shouts so loudly while checking the LLVM project, it would practically halt the analysis of the code, reducing the coverage, which means that checkers other then uninit value checkers would "suffer" from it.

However, I also think that having multiple uninit reports for the same object might be good, especially with this checker, as it would be very easy to see where the problem originated from.

Is this loop guaranteed to terminate? Is something like void *v; v = &v; possible?

I looked this up, and I am confident that it is not possible for a pointer to point to itself. Only a void** object may point to a void*. The loop terminates even if you do something evil like v = reinterpret_cast<void*>(&v); (I have tried this with ints).

I think this loop should unwrap the AST type on every iteration and dereference the pointer accordingly.

In general, i believe that every symbolic pointer dereference should be made with awareness of the type of the object we're trying to retrieve. Which is an optional argument for State->getSVal(R, ...), but it seems that it should be made mandatory because in a lot of cases omitting it causes problems.

I invested some serious effort into getDynamicType, but I think it'll require more time and some serious refactoring of this (isPointerOrReferenceUninit) function, so I'd post it separately.

Ok! Putting any remaining work into follow-up patches would be great indeed. Let's land this into alpha.cplusplus for now because i still haven't made up my mind for how to organize the proposed bugprone category (sorry!).

Well, i guess that's the reason to not use the checker on LLVM. Regardless of fatal/nonfatal warnings, enabling this checker on LLVM regularly would be a bad idea because it's unlikely that anybody will be able to fix all the false positives to make it usable. And for other projects that don't demonstrate many false positives, this shouldn't be a problem.

In order to indicate where the problem originates from, we have our bug reporter visitors that try their best to add such info directly to the report. In particular, @george.karpenkov's recent NoStoreFuncVisitor highlights functions in which a variable was not initialized but was probably expected to be. Not sure if it highlights constructors in its current shape, but that's definitely a better way to give this piece of information to the user because it doesn't make the user look for a different report to understand the current report.

Thanks for the detailed stats! Yeah, that's a lot of "immediately" uninitialized fields. My concerns are mostly clarified. Yeah i was curious about references as well, but it's clear that chains of length 1 contain no references.

You have a check for isCalledByConstructor() in order to skip base class constructors but then you check them manually. That's a lot of code, but it seems that the same result could have been achieved by simply skipping the descent into the base class.

the warning message after a call to A::A() would be "3 uninitialized fields[...]", and for B::B() inside As constructor would be "2 uninitialized fields[...]", so it wouldn't be filtered out.

In my opinion, if A contains a field of type B, it should be the responsibility of A's constructors to initialize those fields properly.

You have a check for isCalledByConstructor() in order to skip base class constructors but then you check them manually. That's a lot of code, but it seems that the same result could have been achieved by simply skipping the descent into the base class.

I also believe that a constructor "should be held responsible" for the initialization of the inherited data members. However, in the case of base classes, uniqueing isn't an issue, as in this case:

struct B {
int a;
B() {}
};
struct D : public B {
int b;
D() {}
};

a call to D::D() would not check the inherited member (a), if I didn't implement it explicitly. That also means that if isCalledByConstructor was refactored to not filter out base classes, we would get two warning each with a single note, reporting b and a respectively. (I haven't actually checked it, but its a reasonable assumption)

Actually, I'm not 100% sure which approach is better: a warning for each base, or one warning for the object, bases included. I personally think one warning per object results in the best user experience.

I like both ideas, but I think if we'd come to the conclusion that a warning per base should be the way to go, it should definitely be implemented in a followup patch, as isCalledByConstructor is already in line for some major refactoring.

Oh, right. Yea. Silly me. Well in any case, non-void pointers can't point to themselves, as I understand it. And since void pointers are skipped for now, I think (and have found after some serious testing) that loop will terminate just fine.

I think it's better to have a stronger guarantee that this loop terminates.

Agreed. I see now however why did you find parts of this function so fishy.

I decided to mark the inline comments in isPointerOrReferenceUninit regarding the dereferencing loop termination done for now. I left several TODO's in the function to be fixed later, with many of them depending on one another, so fixing them all or many at once would probably be the best solution.

I left two conversations "not done" (fatal/nonfatal errors and base classes), but if I may, do you have any other objections?

LLVM is a special project in the fact that almost every part of it is so performance critical, that leaving many fields uninit matters. However, I would believe that in most projects, only a smaller portion of the code would be like that.

Suppose that we have a project that also defines a set of ADTs, like an std::list-like container. If that container had a field that would be left uninit after a ctor call, analysis on every execution path would be halted which would use an object like that.

My point is, as long as there is no way to tell the analyzer (or the checker) to ignore certain constructor calls, I think it would be best not to generate a fatal error.

Regardless of fatal/nonfatal warnings, enabling this checker on LLVM regularly would be a bad idea because it's unlikely that anybody will be able to fix all the false positives to make it usable. And for other projects that don't demonstrate many false positives, this shouldn't be a problem.

I wouldn't necessarily call them false positives. This checker doesn't look for bugs, and all reports I checked were correct in the fact that those fields really were left uninit. They just don't cause any trouble (just yet!).

I think of this check as a tool to support a specific programming model where every field needs to be initialized by the constructor. This programming model might be followed by some parts of the projects while a 3rd party library in the same project or some other files might not follow this model. Right now there is no easy way to turn off some checks for a set of files, and there is no way to turn off a set of checks for some headers. For this reason, I think it is not a good idea to make these errors fatal, as 3rd party headers might reduce the coverage of the analysis on a project in a way that the user cannot control.

If we are afraid of having multiple reports from this check we could turn off the check for that particular path, for example, we could have a bool stored in the GDM for each path whether this check is already reported an error or not and we can check that before emitting warnings.

@Szelethus I personally really like this idea, and I think it would be a great checker.
Could you perform more evaluation on other projects?

Thanks for the encouraging words!

I have checked some other projects:

Xerces: I only found one uninit field (shockingly), and I found that it was probably unintentional.

grpc: This project depends on numerous third-party libraries, so it has a variety of coding and implementation styles. The checker emitted 5 warnings, and I think it's very likely that at least 3 of those were unintentional (4, if we count the missing =default;).

CppCheck: No warnings at all :)

The checker didn't crash during the analysis of any of the projects.

After some thinking, I decided to mark the comments on base classes done. I believe that the current implementation results in the best user experience, but even if we were come to the conclusion that it is not, we could resume the conversation on this when I submit a patch to fix related issues within isCalledByConstructor.

I still don't fully understand the reasoning behind who's responsible for initializing the field, i.e., the exact rule that you're enforcing seems to cause warnings to depend heavily on meta-information such as where does the analysis start, which is confusing. Because we're trying to enforce a coding guideline here, i believe that the guideline itself should be transparent and easy to explain. In particular, it should be obvious to the user which constructor is responsible for initializing a field, and by obvious i mean it should be obvious both from the guideline itself and from the analyzer's warning, and both of them should be the same "obvious".

@sylvestre.ledru Have you found any actual bugs using this checker?@Szelethus Interesting, it seems that the pointer itself is initialized, but not what it's pointing to.
I think we should just check the fields directly, and do not attempt to traverse the pointer hierarchy.

Well, that is intentional: not the pointer, but the pointee is uninitialized, as you can see from the note message. Now with that being said, I have had an other colleague of mine complain about a report, as he didn't see that the note message said "pointee" not "pointer", so maybe there's a point in trying to come up with a better message.