Description

We changed CollecTor almost a year ago to provide descriptors via https rather than rsync. We even turned off rsync a few months later. But we failed to provide a means for metrics-lib based applications to fetch descriptors via https. We should add that now.

Yes, that's the most up-to-date commit. I'm planning to add another class, DescriptorWriter, that generalizes writing descriptors to a local directory structure, which would be necessary for Onionoo to fetch descriptors from CollecTor and cache them locally before processing them with DescriptorReader. But I'm not planning to change the DescriptorCollector code in the next few days except for incorporating feedback. Thanks!

Is there a task already for switching to java 7 and adding some logging framework?
These two comprise my "ceterum censeo ..." :-)
In general, could this change be also used to softly introduce some logging into metrics-lib?

Now the code comments/questions:
(I hope I didn't miss any new test cases ...)
It might be helpful to add some tests for DescriptorCollectorImpl that explicit test for certain behavior.
These would be a very good source of documentation: e.g. take the comment
inside "putTogetherHistory". A test could explicitly verify the functionality
described in the comments.

DescriptorCollector interface:
Why not turn the comments into javadoc? They are there already and very
readable. Just adding the tags for return values and parameters is not a big deal.

Concerning DescriptorCollectorImpl:
Let's try to avoid the TODOs here from the beginning ;-)
For the various constants (COLLECTOR_BASE_URL, CONNECT_TIMEOUT_MILLIS, READ_TIMEOUT_MILLIS)
I would suggest using a system property (defaulting to the current values) as for the class
names in DescriptorSourceFactory using

as property names. Then it will be configurable immediately.
The retrieval of the 'int' properties needs just one extra method that returns
either the converted property value or logs an error and returns the default.
The URL could also be verified, if there is time to implement it now.

The other TODO is the Apache's directory list parsing: the regex parsing
approach is straigthforward and ok, but I'd make it testable and definitly
log any problems. So a method 'addDirsFromListing(SortedMap<String, Long> list)'
with the code from lines 177-200 and the additional logging might be useful.

Well, and the test for parsing directory string needs to be added.
Tests for the property handling above would be helpful, too.

Before I respond, I must confess that I'm currently thinking about making a major change to DescriptorCollector: the way it works right now is that it fetches descriptors, parses them, hands them over to you, and you do with them whatever you want. But most applications would want to cache descriptors locally before processing them, and in order to do that, we'll need a DescriptorWriter.

I started writing such a class and ran into three problems:

We need to distinguish whether a ServerDescriptor was published by a relay or a bridge. I think we can do that by looking at router-signature vs. router-digest, and I even wrote code for that, but that needs testing. So, not a big deal.

In order to write a Descriptor to disk, we sometimes need additional information. For example, a Microdescriptor doesn't contain a publication time, so we'd have to provide that seperately if we want to write descriptors to separate directories per month. Also, BridgeNetworkStatus doesn't contain the fingerprint of the bridge authority. Solving these issues is possible, but not pretty.

Here's the party killer: if we want to use DescriptorCollector to mirror CollecTor's descriptor archive tarballs, we'd instead receive all descriptors contained in tarballs and would then have to write them to new tarballs. In that use case we don't want to parse descriptors, which would even slow us down a lot.

So, I wonder if we should change DescriptorCollector to simply synchronize files from the CollecTor website to a local directory without parsing descriptors at all. Then we can use DescriptorReader as usual to process them. The only use case that this wouldn't cover is when we want to process CollecTor's descriptors on-the-fly without having them touch the disk, but that doesn't seem important enough to force the other use cases into this model.

Thoughts?

Let me also reply inline.

Is there a task already for switching to java 7 and adding some logging framework?
These two comprise my "ceterum censeo ..." :-)
In general, could this change be also used to softly introduce some logging into metrics-lib?

Both are good ideas, but let's do them when all services are migrated away from the dying yatei host.

Now the code comments/questions:
(I hope I didn't miss any new test cases ...)
It might be helpful to add some tests for DescriptorCollectorImpl that explicit test for certain behavior.
These would be a very good source of documentation: e.g. take the comment
inside "putTogetherHistory". A test could explicitly verify the functionality
described in the comments.

Agreed. For now, I'd risk doing the testing by setting up the services locally with the new metrics-lib and seeing if they still work okay. But yes, we should add tests.

DescriptorCollector interface:
Why not turn the comments into javadoc? They are there already and very
readable. Just adding the tags for return values and parameters is not a big deal.

Sure, why not. I'll fix that. We might later want to go through the other classes and clean up docs there, too. But there's no harm in starting to do this with newly added code.

Concerning DescriptorCollectorImpl:
Let's try to avoid the TODOs here from the beginning ;-)
For the various constants (COLLECTOR_BASE_URL, CONNECT_TIMEOUT_MILLIS, READ_TIMEOUT_MILLIS)
I would suggest using a system property (defaulting to the current values) as for the class
names in DescriptorSourceFactory using

as property names. Then it will be configurable immediately.
The retrieval of the 'int' properties needs just one extra method that returns
either the converted property value or logs an error and returns the default.
The URL could also be verified, if there is time to implement it now.

The current way of configuring DescriptorCollector is by calling methods on it after it's created and before it starts collecting. I'd rather want to make things configurable that way. It's actually easy to do that, so we can change those TODOs into methods.

I'd also want to have a more comprehensive plan for using properties in metrics-lib. It's quite possible that using them is a better idea than the current way of configuring things. But I'd want to have a high-level overview of that first.

The other TODO is the Apache's directory list parsing: the regex parsing
approach is straigthforward and ok, but I'd make it testable and definitly
log any problems. So a method 'addDirsFromListing(SortedMap<String, Long> list)'
with the code from lines 177-200 and the additional logging might be useful.

Will look into that. Making the code more testable sounds great.

Well, and the test for parsing directory string needs to be added.

Agreed.

Tests for the property handling above would be helpful, too.

(See above.)

What about adding the @Override annotation where appropriate?

Sure, if it's missing that's an oversight. I'll add that.

So, I'm curious to hear what you think about my suggestion above. Sorry for letting you review the current code and then suggesting to change it, but I didn't envision to run into all those problems. Thanks again!

Just mirroring the directory and handling the data later does seem to be the better approach.

Thanks, for providing the ExoneraTor use-case!

Here some comments and questions:

In my opinion/experience it usually pays off very well to write tests immediately when introducing regex-matching. This would document precisely what type of directory is expected and help with furture additions and changes.

As far a I can tell, you only want the "DescriptorCollector" to run once per instance? Why not just offer an interface consisting of one possibly overloaded method maybe with an additional boolean for the extraneous-file-cleanup (directories could be handled as List<String> or as String array)? If the method signature is too long, one could introduce a simple configuration class "CollectorConfiguration" as parameter for "collectRemoteFiles".

Is the repeated call of "collectRemoteFiles" not a valid use case?

Should the file-cleanup not be performed after each collector run?

The return value of "parseDirectoryListing" is not used; and in case of a problem nothing is logged (or printed to System.err). Same with "fetchRemoteFile", here it might be even more valuable to know why the download failed.

If the above suggestion (2.) is not used, the code might be a little more readable having the following method

I changed the interface towards a single method collectRemoteFiles() with five parameters. I took out the timeouts entirely, because I don't consider them important at all; having them in the previous interface was okay, but this interface is better with as few parameters as possible. We could have used overloaded methods, but these five parameters all looked reasonable to have in the interface.

Just one more question:
When receiving something other than a 200 response code while fetching the directories or files shouldn't that be reported in some way? Things like redirects, bad requests, not authorized, or others might be important for the caller?

Good question. I just opened the more general ticket #16225 to come up with an answer. Let's leave this ticket open until we have one.

While going through all open metrics-lib tickets I found that keeping tickets open that have lots of comments doesn't really serve us well. The question above is still a good one, but it's unnecessary to read the bulk of the discussion above to find an answer to it. I copied over the question to #16225, so that we don't miss it, and now I'm finally closing this ticket as implemented.