> fsnotify is a backend for filesystem notification. fsnotify does> not provide any userspace interface but does provide the basis> needed for other notification schemes such as dnotify. fsnotify> can be extended to be the backend for inotify or the upcoming> fanotify. fsnotify provides a mechanism for "groups" to register for> some set of filesystem events and to then deliver those events to> those groups for processing.> > fsnotify has a number of benefits, the first being actually shrinking the size> of an inode. Before fsnotify to support both dnotify and inotify an inode had> > unsigned long i_dnotify_mask; /* Directory notify events */> struct dnotify_struct *i_dnotify; /* for directory notifications */> struct list_head inotify_watches; /* watches on this inode */> struct mutex inotify_mutex; /* protects the watches list> > But with fsnotify this same functionallity (and more) is done with just> > __u32 i_fsnotify_mask; /* all events for this inode */> struct hlist_head i_fsnotify_mark_entries; /* marks on this inode */> > That's right, inotify, dnotify, and fanotify all in 64 bits. We used that> much space just in inotify_watches alone, before this patch set.> > fsnotify object lifetime and locking is MUCH better than what we have today.> inotify locking is incredibly complex. See 8f7b0ba1c8539 as an example of> what's been busted since inception. inotify needs to know internal semantics> of superblock destruction and unmounting to function. The inode pinning and> vfs contortions are horrible.> > no fsnotify implementers do allocation under locks. This means things like> f04b30de3 which (due to an overabundance of caution) changes GFP_KERNEL to> GFP_NOFS can be reverted. There are no longer any allocation rules when using> or implementing your own fsnotify listener.> > fsnotify paves the way for fanotify. people may not care for the original> companies that pushed for TALPA, but fanotify was designed with flexibility in> mind. A first group that wants fanotify like interfaces is the readahead> group. So they can be profiling at boot time in order to dynamicly tune> readahead to help with boot speed. I've got other ideas how to use fanotify> to speed up boot when dealing with encrypted mounts, but I'm not ready to say> it yet since I don't know if my idea will work.> > This patch series just builds fsnotify to the point that it can implement> dnotify and inotify_user. Patches exist and will be sent soon after> acceptance to finish the in kernel inotify conversion (audit) and implement> fanotify.> >> ...>> +void fsnotify(struct inode *to_tell, __u32 mask, void *data, int data_is)> +{> + struct fsnotify_group *group;> + struct fsnotify_event *event = NULL;> + int idx;> +> + if (list_empty(&fsnotify_groups))> + return;> +> + if (!(mask & fsnotify_mask))> + return;> +> + /*> + * SRCU!! the groups list is very very much read only and the path is

Why is this called "evict"? In Linux, the term "eviction" implies somesort of involuntary asynchronous reclaimation. But afaict (and lackingexplanatory documentation) this function seems to be a plain oldfreeing function. So why is it not called fsnotify_free_group()?

/* * OK, now we know that there's no other users *and* we hold mutex, * so no new references will appear */> + __fsnotify_evict_group(group);> +> + /* now it's off the list, so the only thing we might care about is> + * srcu acces.... */

"access"

> + mutex_unlock(&fsnotify_grp_mutex);> + synchronize_srcu(&fsnotify_grp_srcu);> +> + /* and now it is really dead. _Nothing_ could be seeing it */> + fsnotify_recalc_global_mask();> + fsnotify_destroy_group(group);> +}> +>> ...>> +/*> + * Either finds an existing group which matches the group_num, mask, and ops or> + * creates a new group and adds it to the global group list. In either case we> + * take a reference for the group returned.> + *> + * low use function, could be faster to check if the group is there before we do> + * the allocation and the initialization, but this is only called when notification> + * systems make changes, so why make it more complex?

> +};> +> +/*> + * A group is a "thing" that wants to receive notification about filesystem> + * events. The mask holds the subset of event types this group cares about.

It's unclear what the "event types" are. FS_* from above?

Perhaps things would be clearer if they were named FS_EVENT_*, or FSE_*?

> + * refcnt on a group is up to the implementor and at any moment if it goes 0> + * everything will be cleaned up.> + */> +struct fsnotify_group {> + struct list_head group_list; /* list of all groups on the system */> + __u32 mask; /* mask of events this group cares about */> + atomic_t refcnt; /* num of processes with a special file open */> + unsigned int group_num; /* the 'name' of the event */> +> + const struct fsnotify_ops *ops; /* how this group handles things */> +> + unsigned int evicted:1; /* has this group been evicted? */

If someone adds another bitfield here then they will share the sameword and will hence need locking. It'd be less risky to just make thisa plain old `unsigned'. Or `bool'.

> + /* groups can define private fields here */> + union {> + };> +};> +> +/*> + * all of the information about the original object we want to now send to> + * a group. If you want to carry more info from the accessing task to the> + * listener this structure is where you need to be adding fields.> + */> +struct fsnotify_event {> + spinlock_t lock; /* protection for the associated event_holder and private_list */> + struct inode *to_tell;

Does the existence of a `struct fsnotify_event' cause a reference to betaken on fsnotify_event.to_tell?

If so, that's useful information to add here.

Either way, a few words about the design of the lifetime managementwould be helpful.