This comment has been minimized.

Minor concern :
Since preprocessing is where we push all of your data augmentation API,
should Data augmentation Layers (proposed https://github.com/keras-team/keras/pull/9056/files) be in the keras/preprocessing or since it's a layer, we should push it on Keras?

We should define clear boundaries for each repos. Also, will others repos be allowed to add dependencies?

And most examples in keras/examples will rely on all 3 repos. Should we add a separate setup.py for this folder to avoid circular dependencies?

This comment has been minimized.

They would just depend on Keras. Preprocessing/etc should still be importable from Keras. This change would be a pure refactoring, where the code is moved to a different place but the APIs don't change.

For now I would suggest Keras contrib. But conceptually a layer would definitely be in Keras, not keras-processing, regardless of what it does. The question is thus whether we want data augmentation layers as part of the core Keras API.

This comment has been minimized.

The two new repositories will need "owners". An owner is someone who has write rights and bears responsibility for reviewing PRs (and merging if they think it should be merged), and to a large extent, for answering issues. I will still give guidance on API choices, but that will be limited to user-facing APIs decisions.

@srjoglekar246 which repo are you interested in working on, keras-processing, or keras-applications?

@taehoonlee would you be interested in becoming owner of the keras-applications repo, since you have done a lot of work on applications so far?

edited

This comment has been minimized.

@fchollet, I'd love to! I wonder if you intend to separate tests for the cores and the applications (as described in "Faster CI runs, better test coverage"). My minor concern is that the application tests will not be triggered whenever the cores are updated if the applications are unbundled.

This comment has been minimized.

My minor concern is that the application tests will not be triggered whenever the cores are updated if the applications are unbundled.

Testing applications when changing layers would fall into "integration tests", not unit tests. Issues with the layers should be caught by the layers unit tests. We could add one application-related integration test to fix this.

This comment has been minimized.

For applications, this would be the other way around right? keras would be a dependency of applications.

The split modules need to be importable from Keras, so they would be a dependency of Keras. We're not changing any API.

That means that keras-applications will need to dynamically load Keras at runtime. This is required anyway since we'll want the same applications module to work for tf.keras too!

There is some "preprocessing" code in applications as well, (imagenet_utils). Would that be moved to applications repo or preprocessing repo?

That's application-specific, so it would stay in the keras-applications repo.

What about the examples? Separate repo?

I think the examples are fine as they are, we don't have the same constraints there. They're already well-separated from the codebase itself (and the API), and they're not tested so they don't inflict additional load on CI. They're also not replicated in tf.keras so we don't need a new "single source of truth" mechanism for them -- they're already in one place.

This comment has been minimized.

edited

This process can be managed fairly smoothly if you'd like to do it, but I'd like to provide some food for thought based on lessons I've taken from seeing this process unfold in several other widely used projects (boost, homebrew, and ROS are examples). Here are some questions I'd ask and answer:

What concrete benefits are there from a user focused perspective?

Are these benefits a significant proportion of users actually care about?

Do these benefits to users offset the cost and complexity of added install and versioning steps?

Is there is a risk of some unforeseen problems + extra user issues?

How upset will users be about having to "fix" stuff when they update?

At this time, the effects of items like "easier to sync with tf.keras" seem murky with respect to users and might benefit from better definition and/or alternative solutions. What if CI problems were solved with backend infrastructure changes that would be invisible to users? Could "owners" simply be assigned keras subdirectories? Everything might be totally worthwhile in the end, I'm just hoping to give perspective on alternative options.

The project might also consider taking the following steps, if appropriate. I believe they turned out to be enormously helpful on other projects:

Establish a clear versioning + dependency updating process in advance

Responsibilities for between repository breakage should be clearly defined

Updates to one repository will very frequently break others in tricky ways.

Avoid assuming "you should always track either master / the latest release" will be feasible for everyone.

The homebrew project split into a bunch of "modular" repositories and two years later merged back down to two repositories (the "brew" app and "recipe" installer scripts) to better manage the manifold dependency problems that emerged.

Make a clear policy on how breakage between versions will be handled (or not).

Think about how additional policy changes will be made if the process doesn't turn out as expected.

Provide clear messages and automatic detection / acquisition of the correct dependencies with a way to override it.

Some of the above can probably come from pip so perhaps it isn't as big deal in this case when compared to others I'm familiar with.

Example issues that might come up include:

Import errors mysterious to users who aren't very experienced with python.

Someone will inevitably want to use their preprocessing workflow on a past version with the current Keras version and vice versa, and then post about the API conflicts they encounter.

Hopefully this information is helpful with navigating the process for any approach you decide to take!

This comment has been minimized.

@ahundt thanks for the detailed thoughts. These are all very good points.

the effects of items like "easier to sync with tf.keras" seem murky with respect to users

The idea is that this is a refactoring change (refactoring to get less code duplication): it makes life easier for devs (which ultimately had indirect user benefits) while not affecting the experience at all for end users.

Specifically, we're trying to:

avoid code duplication across tf.keras and keras-team/keras (duplication ultimately could create issues for end users of either tf.keras or keras-team/keras, since the same versioning problems you mention still exist in this setup and are arguably worse)

make CI faster, which helps contributors

I agree that we need a clear plan to make sure that the user experience is not affected. I think automatic dependency management (pip) solves most issues.

Establish a clear versioning + dependency updating process in advance

We should push out new releases of preprocessing and applications every time we do a Keras release. Inversely, having the new modules do independent releases doesn't seem to be an issue, assuming that the new releases keep backwards compatibility (which they should).

Let's say current Keras is 2.1.8 and current preprocessing is 1.0.1. We want to update Keras. Then we:

release preprocessing 1.0.2 which lists keras>=2.1.8 as a dependency

(same for applications)

release keras 2.1.9 which lists preprocessing 1.0.2 as a dependency

This implies that the bleeding edge version of preprocessing and applications is always compatible both with the latest Keras release and with the bleeding edge of Keras.

Responsibilities for between repository breakage should be clearly defined

Thankfully we're in a situation where this will not happen. If this was a risk we would not do it. In our case preprocessing and applications are very well scoped and separate from core Keras. Keras doesn't depend on them at all, and they're just consumers of a stable subset of the Keras API (Sequence and layers/models respectively). Any change in Keras that would break the new modules would be in itself a bug, since Keras isn't supposed to make breaking changes to these APIs. In particular the new modules do not and should not consume any private/unstable Keras API (which is not an issue at all given how little coupling there is between the modules).

What we should do is CI-test the new modules both against the last release of Keras, and against the bleeding edge version of Keras. That way we ensure that changes in bleeding-edge modules don't break any existing users, and that they are ready to be used with bleeding-edge Keras. This guarantees an additional layer of stability.

(the reverse is not a problem since the Keras logic does not depend on the new modules, the pip dependency will only be to keep the existing namespace keras.preprocessing/etc).

Avoid assuming "you should always track either master / the latest release" will be feasible for everyone.

I think this is solved by having the new modules be compatible with both the bleeding edge and the latest release.

Think about how additional policy changes will be made if the process doesn't turn out as expected.

If we have intractable issues, we can always revert to the current state of affairs at any time, since:

What we're doing, and the reverse op, would be 100% API-compatible with the past

Doing so is just a matter of copy/pasting code, so can be done quickly and without issues

Provide clear messages and automatic detection / acquisition of the correct dependencies with a way to override it.

This comment has been minimized.

edited

This sounds like a pretty reasonable plan to me. I'll just mention a couple additional potential pain points to consider based on your comments:

In particular the new modules do not and should not consume any private/unstable Keras API (which is not an issue at all given how little coupling there is between the modules).

This would definitely mitigate many potential issues. I think to make this happen the following steps might be worthwhile:

Perhaps some currently private APIs might be worth making public in this case to avoid code duplication

Have travis check for and prevent external calls with leading underscores.

Have something in the readme on how to request an internal API be made external, this may simply be one sentence that goes in with the existing API change process.

Responsibilities for between repository breakage should be clearly defined

Thankfully we're in a situation where this will not happen. If this was a risk we would not do it.

I'll mention a couple specific examples of conflicts that showed up later in travis. keras-contrib actually broke quite a few times over the first few months due to internal keras changes for which internal calls were not easily avoidable without duplication. Examples include:

imagenet_utils was one major source when the _obtain_input_shape() API changed

Everything can be resolved if there are cycles available to fix it, I just wanted to give a couple of specific examples of pain point categories that will multiply as the number of repositories increases.

I think this is solved by having the new modules be compatible with both the bleeding edge and the latest release.

Someone, or many individuals might deploy code in production and not be able to update for one valid reason or another. One option is to have periodic LTS releases are the typical way to meet the needs of groups like that, but it isn't the only way to go.

This comment has been minimized.

I've been delaying doing this because I wanted to think about it for a while and make sure this was the right thing to do. Thanks everyone for the various comments and feedback! Now, let's make a decision.

@Dref360@taehoonlee: should we do this split of preprocessing and applications into separate repositories, and would you take ownership of the split repos? (@Dref360 for preprocessing and @taehoonlee for applications?

We'll make a decision asap based on your answers, and do the split next week if the decision is to do it.

This comment has been minimized.

We should definitely do the split for keras.applications, which would be the "official" keras-zoo.

For keras.preprocessing, I think your motivations are valid. And the community will be much more easier to manage with it. I'm kinda worried about the

That means that keras-applications will need to dynamically load Keras at runtime.

We should be very cautious with that.
A user that tries to do import keras.preprocessing and gets Cannot find keras with a traceback that goes through keras-preprocessing repositery will be confused.

Also, we would need CI for multiple keras versions and deprecate the versions overtime.

So overall I would like the repo to move foward with the split and we should follow every @ahundt advices. I think all of them are valid concerns.

On the ownership, I can take care of keras.preprocessing, I would need some help with keras.preprocessing.sequence since it's not my main area of research. I am open to get some help from other contributors as needed. :)

This comment has been minimized.

edited

keras-contrib currently has a specific, clear example (travis link) of the sort of breakage I mentioned might be a concern in unbundled modules. I believe master moved some files around and removed some legacy code, and now the keras-contrib build doesn't work.

I haven't figured out a fix yet, but haven't spent much time on it either. This is just for awareness that this sort of thing would become much more common, and the person who makes the change many not realize or have a chance to fix the issue downstream. This sort of thing also doesn't show up until travis automatically re-runs the dependency, which in the case of keras-contrib is up to a 24 hour delay.

This comment has been minimized.

@ahundt, The first one arose due to a dependency to legacy codes and seems to be already resolved by you. The preprocessing and applications exploit public and stable APIs and do not depend to legacy codes.

The second one is orthogonal to the split. It even fails to import keras (AttributeError: module 'pandas' has no attribute 'core'). Why don't you try pip install tensorflow==1.7 or python: 3.6 or export LD_LIBRARY_PATH based on the master CI config?

This comment has been minimized.

edited

@taehoonlee yes, thanks. The purpose of my post was to provide a clear example so tradeoffs are clear in a more practical capacity than vague what-ifs. This is an admittedly very solvable problems that will need to be resolved around n times for n repositories.

This comment has been minimized.

We need to be able to set what Keras module to use when importing preprocessing (or applications) from inside keras (or from inside tf.keras), so that things like keras.backend.set_image_data_format() get reflected in keras.preprocessing.

This is doable via a method like preprocessing.set_keras_module(sys[__name__])

This implies that we can, in this order: 1) import preprocessing without any reference to Keras 2) set the Keras module right after the import

However Iterator(Sequence) leverages a Keras symbol in the class definition (Sequence), which we would need at import time (so before we would have the Keras module available).

A possible solution is to ship Sequence with preprocessing. @Dref360, what do you think?

This comment has been minimized.

Nevermind, we can structure the code in a way such that we can import the submodule and call set_keras_module before we enter the files where we use Sequence. However, as a result users won't be able to import standalone symbols from image, e.g. from keras_preprocessing.image import Iterator, rather you'd have to do from keras_preprocessing import image; image.Iterator or keras_preprocessing.image.Iterator.

I don't think it matters, because presumably the module would get used from Keras almost every time (from keras.preprocessing.image import Iterator would still work), and also because from keras_preprocessing import image is the best practice in terms of Python style.

This comment has been minimized.

I've implemented this initial plan, but I think we're going to have to go with a different design to make it possible to use tensorflow (which will set the keras_preprocessing keras module to tf.keras) and keras in the same session without subtle problems.

This comment has been minimized.

I've tried a couple different designs, but things are turning out to be incredibly complicated in practice when you start converting Keras to use keras_preprocessing which would refer to keras (in a way that would work coherently with tf.keras...).

I'm currently leaning towards a more verbose / advanced design, that would be hopefully safe: