Description

Creating a new task to follow-up on today's meeting. We decided that DBAs would look into creating materialized views to help both Analytics and cloud use cases. I suggest we edit this task to keep a list of requirements that the DBAs can work with.

Faster access to corresponding comment_text for ipblocks, logging, and revision

Our access pattern for importing is always: select [subset of fields] from [subset of tables] where timestamp < 'The first day of the current month'.

So, for example, if it's November 2nd, and the materialized views have updated so that they include at least all transactions through November 1st 00:00, that would be great for us. If this would be available but much later, like November 5th or so, then we'd have to do some tests to see how fast we can import from the new infrastructure.

Cloud Requirements

Faster access to the agent and comment views for our users because those two tables depend on many subqueries to get a single result because the rules that determine visibility are in other tables.

Cloud users access the data in all kinds of ways. From my scan of the definitions, a materialized view that is behind another table could cause a broken reference to a comment or actor record briefly, but it wouldn't result in an exposure because the rules are held in the external tables. However close to real-time updates of those tables that we can come is good, but whatever we determine is possible needs to be communicated to cloud users so that they can adjust things if needed.

The content table also does a subquery to the slots table, but it is a single subquery that is probably not going to be as much of a problem. It's just worth noting in case that is eventually incorrect.

This would need some extra tweaks- puppetization on a cron, change of hosts, making it work on both datacenters, a proper temporary storage, making it "better" in code sense (remember this was a quick script at the time coded in 5 minutes), better authentication, using ssl, recording the time of the generation for each table, and the most important part, avoid metadata locking (!) and transitional states with no rows (maybe with some rename work). However, 90% it is there (even if it is rewritten from scratch).

The difficult part is the metadata locking- but maybe we can avoid DDLs completely?

More details on requirements for Analytics. The queries that we'll be running will be of the form:

select [subset of fields] from [one of the tables listed below] where [timestamp < the first day of this month (if there's a timestamp)]

We import the following tables:

archive

ipblocks

logging

page

pagelinks

redirect

revision

user

user_groups

And from any materialized views created to make it faster to grab actor/comment joins to archive/revision/logging.

Obviously SSDs would be better, but these should all be contiguous reads on disk, requiring not too much memory, because no joins are being done in real-time. I think the biggest memory user will be refreshing materialized views, which will probably join actor/comment to archive/revision/logging. From a speed point of view, here's what we need:

1st of the month: kick off any materialized view refreshes

when refreshes are done, import data

when import is done, kick off jobs which take at most 1 day

Before the performance problems, this process started on the 5th of the month and it generally finished by the 10th of the month. So if we had the same performance, we'd finish around the 5th of the month. If these boxes are half as fast at performing these tasks, we'd finish on the 10th. I'd say if they're close to half as fast or worse, we should get faster boxes. If not, we're good for now as a temporary solution.

@Marostegui while Dan answers your Q, I like his idea: Could the HDDs handle replication if we only replicated the tables Dan mentioned? I'd assume it wouldn't matter that much, as those are the most active tables and probably responsible for most of the replication load.

Happy to provide any other detail. (If you're looking at the templates, that max/min stuff is what gets substituted where you see $CONDITIONS, and the split_by is how we tell sqoop which column we want to use). Oh, and the convert(... using utf8) stuff is there because we export to Avro and it needs those to convert column types like varbinary into strings.

@Marostegui while Dan answers your Q, I like his idea: Could the HDDs handle replication if we only replicated the tables Dan mentioned? I'd assume it wouldn't matter that much, as those are the most active tables and probably responsible for most of the replication load.

There is also the fact that we would have a another role to maintain, as the labsdb role would no longer be usable as we would need different configurations, filters etc.
There is also the fact that as we would need to put more/different filters in and in case another table is needed in the future (for whatever reason) we would need to do an import on 900 wikis, so that's operational a huge effort.
We have already experienced that on labs, when a table that was blacklisted is all of a sudden fine to replicate and we need to get it reimproted.

As discussed in the meeting, I think we need to try to solve this (or at least give a proper try) to from a MySQL/views points of view first.

Throwing hardware in is ok as long as we are really sure it can fix it, which is not the case so far, at least not with the proposed hardware - which implies a few big and potentially becoming tech debt changes. When we talked about maintaining 2 (or whatever number of hosts) it was assumed (or sort of) they would be the same as the current labs hosts which is no longer the case unfortunately :-(.

Let's please try to keep the tasks organized and discuss hardware on the other task and materialised related things otherwise it is going to be hard to keep information on the same subject.

@Milimetric could you be a little bit more specific please? As I understood currently you run the queries described in T210693#4795470 and what you need from us is to create a few tables which would make those queries run faster, right?
What is not clear to me, is that we - as the Persistence Team - should figure out how to build those tables or you can provide us some SQL queries which will build those tables, and our job is to integrate this process into our current system?
I have a proposal about the integration part in T210693#4796863: that would install a script on any chosen host, and run it via cron. Currently it is aimed to labsdb1010, but it could be easily moved to any future host.

Thanks to @Banyek for the questions and a talk we just had over hangouts, we decided to go in two directions in parallel:

considering the complications that have been coming up here with the new cluster/replication/materialized views work, we think the solution the Analytics team is developing for the short-term might actually be better for the medium-term as well, until a better sanitize pipeline is in place. I'll describe this approach below, and we should consider it in our meeting tomorrow (cloud/dba/sre representatives are doing a hangout tomorrow).

I will send @Banyek the exact query we would have to run for the revision table (after the comment/actor refactor) so he can experiment with materialized views based on that query, and get a feeling for performance hit, disk space, etc.

I will follow up with both of these directions in comments here, so they can be linked independently.

To sum up discussion with @Banyek this morning:
From the Cloud perspective this is a problem with the replica views that Analytics happens to be a sophisticated and well-connected reporter on. We'd like any solutions to be something that can be used by more than one user of the wiki replicas since this will start to generate increasing tickets in time from other users who notice the same problems. As is, I can grab 10K page records in 0.04 sec. It takes over 4 sec for the same query on comment with no filtering. I can't help but imagine more interesting queries are going to run into problems in some other cases. The type of materialization we are talking about, triggered on schedule rather than some kind of hook, would only fit limited uses, but folks might be happy to have the option (such as Analytics).

So I'm interested in how often a materialized view could be created for the comment and actor tables. It wouldn't fit everyone's use cases, but it is likely to fit some more than just Analytics.

So we are all on the same sheet of music, the query that creates the comment view in question (with a particular definer set as well) is:

Because re-creating these tables on a regular basis is likely to not be a small operation that might run into table locking problems, I am not all that hopeful that it will be able to be done more often than once a week. I can dream.

Anyway, we discussed running a test tomorrow to see what kind of resources this sort of operation takes up, doing a table create on one of the replica servers depooled in a wiki_p database, I presume. I'm not sure the tables are completely filled at this time or what kind of backfill is planned for them, so our mileage may vary from the test.

+1 to @Bstorm's framing of the problem. I was going to say the same thing, and withdraw my revision query from above from @Banyek's consideration. We can work around our performance problem with or without more hardware (more is better), but everyone wins if we get faster queries. What the DBAs should consider, with our help, is just improving the performance in general as @Bstorm outlined.

after more closely looking at our temporary solution, we think it's solid enough to work for us going forward, and takes pressure away from DBAs to create any special cases for us. As @Bstorm points out, it seems more and more necessary to improve performance. But if others accept what I describe below, we don't have to be consulted how exactly to improve it.

Now, the only thing we do with actor and comment is to join to the rows we grab from the cloud replicas. So we don't have to do any sanitizing on our side, the join will sanitize for us, because the keys have been nulled or zeroed out. If everyone agrees this is not a new implementation of the privacy pipeline, but just a nice side-effect, then we can go with this solution for the time being and revisit once we have a different sanitize pipeline.

From my chat with @Banyek the approach now is to create tables populated with the following queries and then Analytics will query those directly with their own queries sort of defined at T210693#4795364 and T210693#4797967 ?:

So we are all on the same sheet of music, the query that creates the comment view in question (with a particular definer set as well) is:

Will this need to eventually happen on _all_ wikis? If that is the case, I don't think we'd have enough space to do so. But the test for enwiki will say.

Because re-creating these tables on a regular basis is likely to not be a small operation that might run into table locking problems, I am not all that hopeful that it will be able to be done more often than once a week. I can dream.

Anyway, we discussed running a test tomorrow to see what kind of resources this sort of operation takes up, doing a table create on one of the replica servers depooled in a wiki_p database, I presume. I'm not sure the tables are completely filled at > this time or what kind of backfill is planned for them, so our mileage may vary from the test.

There is also the fact that this operation will most likely generate metadata locking if the host isn't depooled. If Analytics gets their own host, that shouldn't be a problem but for Cloud Replicas that would be a no go, as we'd need a manual pooling/depooling of the host when we have to regenerate the materialized views.

after more closely looking at our temporary solution, we think it's solid enough to work for us going forward, and takes pressure away from DBAs to create any special cases for us. As @Bstorm points out, it seems more and more necessary to improve performance. But if others accept what I describe below, we don't have to be consulted how exactly to improve it.

Now, the only thing we do with actor and comment is to join to the rows we grab from the cloud replicas. So we don't have to do any sanitizing on our side, the join will sanitize for us, because the keys have been nulled or zeroed out. If everyone agrees this is not a new implementation of the privacy pipeline, but just a nice side-effect, then we can go with this solution for the time being and revisit once we have a different sanitize pipeline.

I am not really sure I get the idea.
Let me rephrase to see if I got it right:

You have actor and comment directly imported from production

You import revisionarchive and logging from cloud replicas already existing views which have sanitized data

You join actor and comment production rows with the rows from revisionarchive and logging which already have sanitized data as they come from the cloud views?

Questions:

Given that you are importing actor and comment directly from production and not from Cloud Replicas, you would be bypassing any sanitized data done on a view level to those two tables?:

No, there is no any real reson, but as this is a test only I wanted to keep this 'clean' and enwiki feels more suitable for testing than enwiki_p.
Don't ask why, I can't answer it.
But at the end I can move the test table to enwiki_p is that is preferable

Thanks for the clarification - I just wanted to know if there was some specific reason for it that I might have missed.
Once the test is done make sure to either clean it up or move it to the views DB, as that is where the labsdbuser has access to and to keep it consistent. (views on one db and underlying core tables on the other)

Thanks for the clarification - I just wanted to know if there was some specific reason for it that I might have missed.
Once the test is done make sure to either clean it up or move it to the views DB, as that is where the labsdbuser has access to and to keep it consistent. (views on one db and underlying core tables on the other)

Basically, we are importing actor and comment from production replicas, and these include comment text that might have been redacted via a rev_deleted, log_deleted, etc. flag. But we import revision, archive, and logging from the cloud replicas.

Strictly speaking, if you aren't using the whole comment or actor table, the definition of whether it is sanitized or not is always dependent on whether the row in the external table is marked somehow deleted in the original table. So I imagine that if you only join to the comment or actor table without making those tables available in some way in the same "channel", it's functionally sanitized, yes (with the current definitions of the tables--should that change this may not be true). That means that if you scoop out the views for just revision, archive and logging, I believe that you could join that off prod versions of comment and actor and get sanitized data if you don't miss something that is applied at the sanitarium level for those tables (Is anything? @Marostegui ).

Following up on a good problem that @Bstorm raised with my approach. I would love @Anomie to take a look at my comment as well and see if it makes sense. Ok, so to confirm @Marostegui's understanding, yes, we are importing archive, logging, and revision from the views in the cloud replicas, and actor, comment unsanitized from the production replicas. And the only way we use rows from actor and comment is to join them to rows from the three views, which are sanitized. And I believe what @Bstorm was pointing out was that if rev_deleted is set to sanitize rev_actor, that would be fine for the revision table. But if the archive table references the same actor_id through ar_actor, but the ar_deleted flag is not set, then that would not be sanitized. According to the actor view definition: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/labs/db/views/maintain-views.yaml#222, a particular actor_id is not present if it's sanitized by *any* log/rev/ar/etc_deleted flag.

But when building archive_compat, we join to the actor table, right? And only sanitize if ar_deleted & 4. That would mean:

if ar_actor and rev_actor point to the same actor_id, say 123

and the corresponding ar_deleted is 0 but rev_deleted is 4, then

in archive_compat we get the actor_name as ar_user_text

in revision or revision_compat that actor is sanitized

But as I see it, that's ok, because no id would show up in the revision table that would allow us to find that actor. Not the rev_user nor the rev_actor. I'm guessing this is also a case where the views for actor and comment are forced to do too much. If a particular comment or actor name should disappear from everywhere on the wiki, someone would have to set all the corresponding deleted flags to make it so. But the views have to sanitize using an OR instead of an AND just to be safe.

In any case, I think in our use case we're safe, as we're still sanitizing according to the deleted flags, which is what the _compat views are doing. @Bstorm, @Anomie, what do you think?

You seem to have the logic backwards. The actor view will expose the row if any log/rev/ar/etc_deleted flag doesn't "sanitize" it, or if it's referenced from the user or image tables which don't even have an xx_deleted flag.

You seem to have the logic backwards. The actor view will expose the row if any log/rev/ar/etc_deleted flag doesn't "sanitize" it, or if it's referenced from the user or image tables which don't even have an xx_deleted flag.

I think you're just saying the contrapositive of what I said. I said "not present if sanitized" and you said "present if not sanitized". Is there some subtlety I'm missing?

You said "any reference is sanitized → not present". The contrapositive of that would be "present → not (any reference is sanitized)" which is equivalent to "present → all references are unsanitized". Note that production allows for there to be no references at all.

What I said was "there exists at least one reference AND any reference is unsanitized → present". The contrapositive of that would be "not present → not (there exists at least one reference AND any reference is unsanitized)" which is equivalent to "not present → there are no references OR all references are sanitized".

@Anomie I think you actually answered the question in that. As it currently stands, the logic there already effectively shows the row if it shows up in any of those other tables, not the other way around. I was confused at first there.

@Milimetric would like to try using the sanitized views of the revision, archive and logging tables to determine if he can expose data from the actor or comment tables via join (dodging the use of the really slow view on the replicas for that join). Since the condition is an OR, and the check is whether it is NOT deleted in an external table, I think you are effectively saying that he could do that "safely".

@Milimetric would like to try using the sanitized views of the revision, archive and logging tables to determine if he can expose data from the actor or comment tables via join (dodging the use of the really slow view on the replicas for that join). Since the condition is an OR, and the check is whether it is NOT deleted in an external table, I think you are effectively saying that he could do that "safely".

Right?

I'm not sure what exactly you're asking, so I'll give several answers.

You can safely make a view of actor that includes only the "OR EXISTS( ... )" clauses for revision, archive, and/or logging. It'll be a subset of the current actor view.

You can safely make a view of actor that checks against the sanitized revision, archive, logging, etc views (all of them or any subset). In that case you don't even need to check the xx_deleted fields, because those sanitized views already have xx_actor zeroed out.

You can safely make revision_compat, archive_compat, and logging_compat views that join with the unsanitized actor table and consider only the current row's rev_deleted/ar_deleted/log_deleted. That's exactly what the existing revision_compat, archive_compat, and logging_compat views already do, I believe.

If you want to fake up an actor_mat for a size estimate, I think something like this would do it. Obviously it won't work for a speed estimate though.

INSERTINTOactor_mat(actor_user,actor_name)SELECTuser_id,user_nameFROMuser;-- All the rest have "xx_user = 0" because any that aren't 0 should have been done aboveINSERTINTOactor_mat(actor_user,actor_name)SELECTDISTINCTrev_user,rev_user_textFROMrevisionWHERErev_user=0UNIONSELECTDISTINCTar_user,ar_user_textFROMarchiveWHEREar_user=0UNIONSELECTDISTINCTipb_by,ipb_by_textFROMipblocksWHEREipb_by=0UNIONSELECTDISTINCTimg_user,img_user_textFROMimageWHEREimg_user=0UNIONSELECTDISTINCToi_user,oi_user_textFROMoldimageWHEREoi_user=0UNIONSELECTDISTINCTfa_user,fa_user_textFROMfilearchiveWHEREfa_user=0UNIONSELECTDISTINCTlog_user,log_user_textFROMloggingWHERElog_user=0;

If you want to fake up an actor_mat for a size estimate, I think something like this would do it. Obviously it won't work for a speed estimate though.

INSERTINTOactor_mat(actor_user,actor_name)SELECTuser_id,user_nameFROMuser;-- All the rest have "xx_user = 0" because any that aren't 0 should have been done aboveINSERTINTOactor_mat(actor_user,actor_name)SELECTDISTINCTrev_user,rev_user_textFROMrevisionWHERErev_user=0UNIONSELECTDISTINCTar_user,ar_user_textFROMarchiveWHEREar_user=0UNIONSELECTDISTINCTipb_by,ipb_by_textFROMipblocksWHEREipb_by=0UNIONSELECTDISTINCTimg_user,img_user_textFROMimageWHEREimg_user=0UNIONSELECTDISTINCToi_user,oi_user_textFROMoldimageWHEREoi_user=0UNIONSELECTDISTINCTfa_user,fa_user_textFROMfilearchiveWHEREfa_user=0UNIONSELECTDISTINCTlog_user,log_user_textFROMloggingWHERElog_user=0;

I am not sure it makes sense to continue that approach.
We have more than 900 wikis, even if it takes 5GB average per wiki to for that materialized table...we don't have enough space for creating all those tables.

Defining all the triggers might be a pain, but https://dev.mysql.com/doc/refman/5.5/en/replication-features-differing-tables.html makes it sound like we could add an extra column for a "show" flag to the comment and actor tables on the sanitarium without breaking replication. That flag would default to false, and then triggers (like in T209031#4740258) would update it based on changes to the referring tables. Then the views could show rows based on the flag column rather than having to do all those subqueries.

Or we could do more or less the same with added comment_show and actor_show tables, which would each have two columns (the actor_id/comment_id and the "show" flag). One join in the view would still be better than a bunch of subqueries.

On the issue of storage, the average per wiki would definitely not come out to 5GB. If you plot all wikis by number of revisions, the curve has a very long tail, with big wikis like enwiki, commons, wikidata being more outliers than the norm. I just did a basic count(*) in hadoop and enwiki has 801325542 out of a total of 3704431563 revisions across all wikis across all time. If the comment and actor table scale roughly in line with this*, and enwiki has a 31GB materialized comment view as @Banyek mentioned, then my estimate for the overall size of the materialized comment view is around 150GB. Actor should be smaller because it benefits from sharing. I'm not sure if that's still too big, but just wanted to say I have good reason to think it's not as bad as the 4500GB estimate.

*comments should be distributed in the same way across wikis, because they mostly come out of revision and logging which are similar enough in size

Thanks very much @Anomie, I understand my misunderstanding, and your third answer is what I was asking.

Whilst we still discuss if this JOIN "feature" can do what we think I would like also to bring to the table the fact that we should think about a check to make sure the JOIN (or whatever is underlying it) isn't or won't expose private data. Same way we have a check_private_data on our sanitarium and labs hosts, to make sure we are not exposing anything considered private.
How could we do that?
Maybe some sort of checking that runs a few days before the actual import/export runs, so if something is detected, the process won't start. That would make me feel sleep better.

I believe we settled this part, the join I proposed is safe, in that it exposes no more than the cloud views.

I would like also to bring to the table the fact that we should think about a check to make sure the JOIN (or whatever is underlying it) isn't or won't expose private data. Same way we have a check_private_data on our sanitarium and labs hosts, to make sure we are not exposing anything considered private.
How could we do that?

The check_private_data script checks three things:

whether private databases exist

whether private tables exist

whether columns that are filtered out by filtered_tables.txt have data in them

To do this properly, we would write a script that pulls in these definitions and lists of what's private and what's not, and build Hive queries that check the resulting data. We would want to do this as late as possible in the publishing process, because once the dataset is published, it's out there. In Oozie, we could apply this check as a step after the JOINs. And if there's a problem, we can throw an error that will stop the job and alert us.

But I think this effort would be focusing our energies in the wrong place. So, right now, our data flow looks like this if you're thinking about private data:

private data leaks to cloud dbs

private data is copied to Hadoop

a. datasets are crunched and published from Hadoop (the JOINs happen here)

private data is found

a. found on cloud dbs by the check_private_data script, and fixed

b. found on the wikis by editors or WMF staff, and fixed, which is reflected by replication to cloud dbs

on the next snapshot, all fixes are propagated to Hadoop

So what you're proposing is that we copy step 3a and run it in Hadoop as part of 2a. We could do it as I mentioned above, but we would still leak the vast majority of private data, because of the snapshot-in-time nature of what we're doing here. So any problems found from the 2nd of the month to the last day of the month, or 93% of the time, would not be known to us when we build that month's snapshot. So we're just like dumps, we leak most of the private data that's fixed during the month. The main thing that affects what private data we leak is what point in time we cut the snapshot, weighed against how quickly or slowly private data is found and fixed via either 3a or 3b.

So, if we're serious about making a dataset that has the least amount of private data in it within the timeframe that we want to publish it, the way to go about it would be to graph incoming changes to log_deleted, rev_deleted, etc. flags on rows from last month, over time. If there's a big spike of them at the beginning of this month, say on the 3rd of the month, and we delay our snapshot until the 4th, then that would be much more efficient than duplicating the logic of check_private_data and running it when it doesn't have much hope of finding any issues.

What I've said so far kind of sidesteps that check_private_data is more of a check to make sure triggers and MariaDB work the way they're supposed to work. This is not really an issue in Hadoop, we have never seen Hive or sqoop fail in the sense that they bring in the wrong data in the wrong column or something like that. Our queries could be dynamically written to take into account filtered_tables and private_tables and private_databases, and that would be just as good as duplicating check_private_data. But our queries are also very simple and easily checked against the views for even better assurances of privacy.

To me, the best place to put our effort is to recognize that sometimes we have serious privacy leaks that stay around in dumps or analytics datasets for at least a month if not forever. And to start to design and build the next sanitize pipeline that would allow us to propagate fixes in real-time or more regularly to these datasets.

Thanks for the detailed analysis.
My proposal was more in the line to check that the JOIN is always safe.
On the sanitarium+labs side we do have 4 kind of "security" (note the " "!) layers

Replication filters

Triggers

GRANTs

Views

And yet, we still have check_private_data script just to make sure there is something that would alert us if there is some data, column, database that made it (somehow) thru all this layers. Some sort of safety net.
While the JOIN might be safe, I still think we should have some sort of alert mechanism to let us know when the JOIN isn't safe or has changed or might be querying the wrong things.
How to do that? I don't know - but I think we need to spend a bit of time thinking about how could we do that.

To do this properly, we would write a script that pulls in these definitions and lists of what's private and what's not, and build Hive queries that check the resulting data. We would want to do this as late as possible in the publishing process, because once the dataset is published, it's out there. In Oozie, we could apply this check as a step after the JOINs. And if there's a problem, we can throw an error that will stop the job and alert us.

How hard do you think that would take to build?
I am really afraid of just leaving all safety to a JOIN based on views and not having another check which lives outside of the clouds replicas one :-)

To duplicate something like check_private_data in Hadoop, I'd guess a day to write it and a couple of days to review and test it. So probably like a week to get everything deployed and integrated with the current job. We have to change some other jobs too to make them depend on this check.

Unlike the cloud replicas, we don't replicate and publish everything, we only pull a specific set of fields from a specific set of tables. So I don't see how something would sneak by, but if it makes you feel better, I'm happy to spend the week and get it done :)

I should still say that I think we would get much more out of spending a couple of days to understand the data better and then adding a small delay. I'm willing to bet that would sanitize many more private records than the check_private_data approach.

To duplicate something like check_private_data in Hadoop, I'd guess a day to write it and a couple of days to review and test it. So probably like a week to get everything deployed and integrated with the current job. We have to change some other jobs too to make them depend on this check.

Unlike the cloud replicas, we don't replicate and publish everything, we only pull a specific set of fields from a specific set of tables. So I don't see how something would sneak by, but if it makes you feel better, I'm happy to spend the week and get it done :)

Hehe thanks - it is not really my call - I just want to bring awareness that whilst we have checks on the cloud replicas - we won't have any on the other side of the equation.
How we do that, is obviously up for discussion!

I should still say that I think we would get much more out of spending a couple of days to understand the data better and then adding a small delay. I'm willing to bet that would sanitize many more private records than the check_private_data approach.

Sure! You know the Hadoop environment better than me! :)
Again, my point is that we are relatively safe from the cloud replicas side with the checks we have, and I want to feel we are also covered from the other side.
This is not really my call - I am not the one who has to say we have to spend time on that, but I want to make sure this is discussed among everyone and whether it signed off to go "live" without any check or not.

on labsdb1010 I stared to create materialized views from all the comment views, except enwiki - as we know how much time and space it demands. The process is running on the host in a screen named create_mat also it creates a create_mat.loglogfile about the time it takes

@Banyek are you asking whether the views should have indices? I'm not sure for the cloud users use case, you'd have to look at common or slow queries against the _compat views to decide. For the Analytics use case, which shouldn't matter, we just get everything without joining, so an index wouldn't help much.

I'm not sure for the cloud users use case, you'd have to look at common or slow queries against the _compat views to decide.

I don't think that there honestly is a Cloud wide use case for these tables until we have a solution that keeps them in close sync with the live data. Trying to explain to our end users that there are parallel tables with different names that are up to a month behind the live data will be a customer relations headache.

I'm not sure for the cloud users use case, you'd have to look at common or slow queries against the _compat views to decide.

I don't think that there honestly is a Cloud wide use case for these tables until we have a solution that keeps them in close sync with the live data. Trying to explain to our end users that there are parallel tables with different names that are up to a month behind the live data will be a customer relations headache.

Yeah, then at least in this incarnation these views are not a good solution. Honestly this is a really unique problem. The use cases for this cluster are both OLTP and OLAP. The OLTP ones suffer because this replica is not tuned the same as the production one. And the OLAP ones suffer because the schema is made with OLTP in mind. And the scale of the data makes a single solution very unlikely. If we could do two things, we could have:

a near-real-time denormalized-on-replication schema to cover the OLAP use cases

a real-time replica that serves a "recent changes" version of all tables

Talking to consumers to even get the requirements for that kind of split seems impossible, but solving this problem with a single approach is also beginning to seem impossible. Maybe the "talk to people" route is long but at least it's relatively easy.

I'm not sure for the cloud users use case, you'd have to look at common or slow queries against the _compat views to decide.

I don't think that there honestly is a Cloud wide use case for these tables until we have a solution that keeps them in close sync with the live data. Trying to explain to our end users that there are parallel tables with different names that are up to a month behind the live data will be a customer relations headache.

Agreed - I don't think this solution is for every user case and not sure it is worth all the orchestration work (there is a lot) that will need to happen behind the scenes