@Usmanmuhd I don't think you should be overly specific in the proposal. You can start by saying that the goal of the project is to improve the article recommendation pipeline. Also, don't be tricked by the number of tasks. Are you sure you can get those tasks done and pushed to production in 12 weeks? Just doing the coding part doesn't mean your changes will automatically be enabled in production. You'll learn all this as we go along.

Fri, Mar 15

@leila we did staging because we wanted to make sure that the back end can handle the load. Now that we know it can, we can safely use the intended sampling rates. I'm not sure of other reasons why staging is needed. Maybe @Miriam knows?

@Ottomata a heads up that we'll be collecting citation data starting March 20th which lasts one month. The sampling rate is 100% for the schema CitationUsage and 33.3% for the schema CitationUsagePageLoad as before (similar to the second round of data collection: T203253).

Hm, why do you need the the artifact in analytics/refinery? Can you just use scap+git fat to deploy research/article-recommender/deploy to e.g. stat1007 and put the zip file where ever it needs to go (HDFS?).

Thu, Mar 7

I was not aware of the DOI case. Thanks for bringing it up. I think in that case it makes sense to use the URL only and ignore the external flag. It will probably take a couple of weeks to get this adjustment made in code and shipped to production. If time is of concern, then let's derive whether a link is external or not during analysis. Please let me know what you prefer.

I was relying on a local virtual environment on stat1007. I have refactored the code and created a package and uploaded it to PyPi so that I can make it a dependency of the Oozie script. This way I can have a simple entry point for Oozie that depends on this external package. If we want to add more recommendation types, or improve the article recommendation code, then all I have to do is update the package and point Oozie to use the new version.

Wed, Mar 6

If so, perhaps instead of relying on markup for this notion of external-ness, we should consider calculating it based on each link's href attribute? @bmansurov, what do you think?

I think we should use both signals (i.e. the 'external' flag and the link URL) because other than the bug mentioned (T217567), the 'external' flag is pretty accurate. It's unfortunate that a related bug (T13477) has been open for many years and won't be fixed any time soon.

@Nuria having thought about your comment, I think I misunderstood you the first time. I think you mean I should explore whether the way Discovery is doing this is applicable to the research's use case. Please let me know if I got it wrong this time too.

@Usmanmuhd on IRC you mentioned that you'd submit a patch for this task. If you started working on the task, feel free to assign it to yourself. Also feel free to ask questions here, on IRC, or via email.

@Usmanmuhd, welcome! The link's been fixed. There are no micro tasks for this project, unless you want to split up the work into meaningful parts and work on them separately. But I think the project is self containing.

Wed, Feb 27

Interestingly, opening the survey in a new tab did not trigger the QuickSurveysResponses schema as far as I could tell from the client side (by watching the network tab in the web console), which is odd though not a blocker (the QuickSurveyInitiation schema gives not all but enough information in most cases)

That's correct, we're talking about the reader demographics survey. Strangely, I don't see the error in the console and mw.msg('Reader-demographics-1-privacy') is returning something that looks like this: