*Q: V2 format@ 2018-07-11 20:01 ebiederm
2018-07-11 21:18 ` Konstantin Ryabitsev
2018-07-12 1:47 ` Eric Wong0 siblings, 2 replies; 21+ messages in thread
From: ebiederm @ 2018-07-11 20:01 UTC (permalink / raw)
To: Eric Wong; +Cc: meta
I have been digging through the code looking so I can understand the v2
format and I have some ideas on how things might be improved, and some
questions so that I understand.
V1 supported the concept of messages being added and deleted from
the git repository all while keeping a full history of everything that
went on. The V2 code appears to have the name 'm' for added and 'd' for
deleted, but the public-inbox-index code appears to expect deletes to
happen by way of an altered history that totally purge the commits,
and does not process the 'd' entries.
What is the thinking about deleted entries, and for v2 what is the
preferred way to delete mail from a public inbox git repository and why?
Size. Reading the history of the public inbox meta mailling list and
playing around I discovered that I can shave off about 100M of the V2
size of the git public inbox git repository but pushing all of the
messages into a single commit. Not great for day to day operation,
but if rebasses are part of the plan, and old archives part of the
challenge I see quite a lot of potential for old archives to be reduced
to a git repository with a single commit.
Names. Is there a good reason not to use message numbers as the names
in the git repositories? (Other than the cost to change the code?) That
would remove the need for treat the sqlite msgmap database as precious,
and it would make it easier to recover if an nntp server goes away. In
V2 format the git mailing list git repository is only about 2M larger if
each message has it's msg number as it's name. Plus the git log
is easier to read as messages are all + or -.
xapian. Can the Xapian database be made optional in V2? I absolutely
think a quick search for terms and other things very valuable, so I
would never suggest giving up Xapian. On the other hand on my personal
laptop the xapian database for lkml takes ages and ages to build, and it
pushes the system into swap. Which is all around unpleasant. That
seems to eat into the distributed nature of the goal of public inbox.
I have tried to see what could be done that might shrink the size of
the xapian database. The only think I could think of is perhaps
sharding the xapian database by time/msgnum ranges. That would allow
the old xapians databases to be compacted and forgotten about, and I
think it would allow less wastage in the current xapian database as it
would be smaller, so wasting 50% space (or whatever the btrees waste)
would be less of an issue. And as smaller databases are faster I think
that would in general be a help.
Time permitting I am willing to do some of this work so that
public-inbox works well for me. I want to see what your vision is for
the code before I start anything.
Eric
^permalinkrawreply [flat|nested] 21+ messages in thread

*Re: Q: V2 format
2018-07-11 20:01 Q: V2 format ebiederm
@ 2018-07-11 21:18 ` Konstantin Ryabitsev
2018-07-11 21:41 ` ebiederm
2018-07-12 1:47 ` Eric Wong1 sibling, 1 reply; 21+ messages in thread
From: Konstantin Ryabitsev @ 2018-07-11 21:18 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Eric Wong, meta
On Wed, Jul 11, 2018 at 03:01:53PM -0500, Eric W. Biederman wrote:
> Names. Is there a good reason not to use message numbers as the names
> in the git repositories? (Other than the cost to change the code?) That
> would remove the need for treat the sqlite msgmap database as precious,
> and it would make it easier to recover if an nntp server goes away. In
> V2 format the git mailing list git repository is only about 2M larger if
> each message has it's msg number as it's name. Plus the git log
> is easier to read as messages are all + or -.
As in, instead of changes happening to the same file "m", the message is
saved into a new file and the old file deleted in each commit?
-K
^permalinkrawreply [flat|nested] 21+ messages in thread

*Re: Q: V2 format
2018-07-11 21:18 ` Konstantin Ryabitsev@ 2018-07-11 21:41 ` ebiederm0 siblings, 0 replies; 21+ messages in thread
From: ebiederm @ 2018-07-11 21:41 UTC (permalink / raw)
To: Konstantin Ryabitsev; +Cc: Eric Wong, meta
Konstantin Ryabitsev <konstantin@linuxfoundation.org> writes:
> On Wed, Jul 11, 2018 at 03:01:53PM -0500, Eric W. Biederman wrote:
>> Names. Is there a good reason not to use message numbers as the names
>> in the git repositories? (Other than the cost to change the code?) That
>> would remove the need for treat the sqlite msgmap database as precious,
>> and it would make it easier to recover if an nntp server goes away. In
>> V2 format the git mailing list git repository is only about 2M larger if
>> each message has it's msg number as it's name. Plus the git log
>> is easier to read as messages are all + or -.
>
> As in, instead of changes happening to the same file "m", the message is
> saved into a new file and the old file deleted in each commit?
Yes.
I believe from a git object perspective it is exactly the same. 1 tree
object per commit with exactly one file in it. The only difference is that
the files have different names.
Eric
^permalinkrawreply [flat|nested] 21+ messages in thread

*Re: Q: V2 format
2018-07-11 20:01 Q: V2 format ebiederm
2018-07-11 21:18 ` Konstantin Ryabitsev@ 2018-07-12 1:47 ` Eric Wong
2018-07-12 13:58 ` ebiederm1 sibling, 1 reply; 21+ messages in thread
From: Eric Wong @ 2018-07-12 1:47 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: meta
"Eric W. Biederman" <ebiederm@xmission.com> wrote:
> I have been digging through the code looking so I can understand the v2
> format and I have some ideas on how things might be improved, and some
> questions so that I understand.
Great to know you're interested! Fwiw, I've still been meaning
to turn my v2 docs into a POD manpage:
https://public-inbox.org/meta/20180419015813.GA20051@dcvr/> V1 supported the concept of messages being added and deleted from
> the git repository all while keeping a full history of everything that
> went on. The V2 code appears to have the name 'm' for added and 'd' for
> deleted, but the public-inbox-index code appears to expect deletes to
> happen by way of an altered history that totally purge the commits,
> and does not process the 'd' entries.
"Purge" is a new concept for v2 and not even exposed (yet) in
via tools. Normal operations to remove files using 'd' (via
-watch or -rm) don't rewrite old history so it won't disrupt
non-force fetches.
> What is the thinking about deleted entries, and for v2 what is the
> preferred way to delete mail from a public inbox git repository and why?
Definitely prefer the normal way with 'd' files to not break
people using non-force fetches. "Purge" is too disruptive
and reserved for extraordinary cases (e.g. legal reasons).
> Size. Reading the history of the public inbox meta mailling list and
> playing around I discovered that I can shave off about 100M of the V2
> size of the git public inbox git repository but pushing all of the
> messages into a single commit. Not great for day to day operation,
> but if rebasses are part of the plan, and old archives part of the
> challenge I see quite a lot of potential for old archives to be reduced
> to a git repository with a single commit.
Rebases/rewriting history is definitely not part of the plan and
a last resort.
> Names. Is there a good reason not to use message numbers as the names
> in the git repositories? (Other than the cost to change the code?) That
> would remove the need for treat the sqlite msgmap database as precious,
> and it would make it easier to recover if an nntp server goes away. In
> V2 format the git mailing list git repository is only about 2M larger if
> each message has it's msg number as it's name. Plus the git log
> is easier to read as messages are all + or -.
Big trees in git were a scalability problem in v1 because of the
long 2/38 names. With shorter names you propose (base-10 serial
number?, the scalability problem gets pushed off a bit, I suppose.
But not indefinitely; and later v2 partitions will suffer more
from longer names.
I also want to limit the use and exposure of serial numbers as
much as possible. It's unavoidable with the NNTP interface;
but reliance on serial numbers in public interfaces leads to
centralization.
The current v2 is also better for inode-starved users in case
somebody forgets to type "--mirror" or "--bare" with clone. For
the most part (unless purge is used), the SQLite database is
actually recoverable.
So no, I don't think having serial numbers stored in filenames
is the right thing.
> xapian. Can the Xapian database be made optional in V2?
Definitely in the TODO :)
> I absolutely
> think a quick search for terms and other things very valuable, so I
> would never suggest giving up Xapian. On the other hand on my personal
> laptop the xapian database for lkml takes ages and ages to build, and it
> pushes the system into swap. Which is all around unpleasant. That
> seems to eat into the distributed nature of the goal of public inbox.
> I have tried to see what could be done that might shrink the size of
> the xapian database. The only think I could think of is perhaps
> sharding the xapian database by time/msgnum ranges. That would allow
> the old xapians databases to be compacted and forgotten about, and I
> think it would allow less wastage in the current xapian database as it
> would be smaller, so wasting 50% space (or whatever the btrees waste)
> would be less of an issue. And as smaller databases are faster I think
> that would in general be a help.
One big killer for Xapian is position information required for
"quoted phrase searches". I seem to remember deleting the position.*
files was safe as it would only break phrase searches (but I
haven't tried it).
So there should be an option to toggle between the "index_text"
and routines in Xapian "index_text_without_positions".
Given the way the indexing only works on the most recent data;
I think one could also write a script to delete old data/results
from Xapian without affecting current/future indexing.
That would pop back up if/when there's schema upgrades requiring
a rebuild, though...
I believe there should be 3 levels of v2 operation:
1) SQLite-only (NNTP and all the threading stuff works)
2) SQLite + Xapian w/o positions (good enough for most things)
3) SQLite + Xapian w/ positions (current, default)
2) seems like a reasonable trade-off for most sites; I'm not
sure how often phrase searching gets used.
> Time permitting I am willing to do some of this work so that
> public-inbox works well for me. I want to see what your vision is for
> the code before I start anything.
Thanks for running this by, first. I'm not convinced git layout
changes are warranted at this point for v2.
Making Xapian optional and configurable to use
index_text_without_positions is something I definitely want to
see happen, though.
^permalinkrawreply [flat|nested] 21+ messages in thread

*Re: Q: V2 format
2018-07-12 1:47 ` Eric Wong@ 2018-07-12 13:58 ` ebiederm
2018-07-12 23:09 ` Eric Wong0 siblings, 1 reply; 21+ messages in thread
From: ebiederm @ 2018-07-12 13:58 UTC (permalink / raw)
To: Eric Wong; +Cc: meta
Eric Wong <e@80x24.org> writes:
> "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>> I have been digging through the code looking so I can understand the v2
>> format and I have some ideas on how things might be improved, and some
>> questions so that I understand.
>
> Great to know you're interested! Fwiw, I've still been meaning
> to turn my v2 docs into a POD manpage:
>
> https://public-inbox.org/meta/20180419015813.GA20051@dcvr/
I have some personal mail archives that I need to do something better
with. My goal is for day-to-day operations (aka mail delivery and
archiving) to be able to run on a smallish 32bit machine.
But archives are not valuable unless you have a fast search capability
which makes all of the features of xapian very interesting.
I need to compare message id's to see if I have content missing from the
public linux-kernel archive. It is probably Konrad's cleanup of the
headers but my linux-kernel archive when imported into public-inbox is
slightly larger than Konrads.
I also like the idea of being able to read and archive public lists that
I care about with just a git fetch and local tools.
Public mailing lists and their archives are more important, but on my
radar is also IMAP/regular email support. With it's little bit of extra
state.
>> V1 supported the concept of messages being added and deleted from
>> the git repository all while keeping a full history of everything that
>> went on. The V2 code appears to have the name 'm' for added and 'd' for
>> deleted, but the public-inbox-index code appears to expect deletes to
>> happen by way of an altered history that totally purge the commits,
>> and does not process the 'd' entries.
>
> "Purge" is a new concept for v2 and not even exposed (yet) in
> via tools. Normal operations to remove files using 'd' (via
> -watch or -rm) don't rewrite old history so it won't disrupt
> non-force fetches.
This helps a lot in understanding the intent of the code. Konrad had
mentioned something about being able to rebase when I pointed out
the buggy git commits in linux-kernel.
>> What is the thinking about deleted entries, and for v2 what is the
>> preferred way to delete mail from a public inbox git repository and why?
>
> Definitely prefer the normal way with 'd' files to not break
> people using non-force fetches. "Purge" is too disruptive
> and reserved for extraordinary cases (e.g. legal reasons).
Then I am going to report a probable bug. In V2 in public-inbox-index
I can not find a path from finding a 'd' file and a call to unindex. V1
unindexes deleted files. Rebased heads for purges call unindex. I
don't see that for ordinary d files though.
>> Size. Reading the history of the public inbox meta mailling list and
>> playing around I discovered that I can shave off about 100M of the V2
>> size of the git public inbox git repository but pushing all of the
>> messages into a single commit. Not great for day to day operation,
>> but if rebasses are part of the plan, and old archives part of the
>> challenge I see quite a lot of potential for old archives to be reduced
>> to a git repository with a single commit.
>
> Rebases/rewriting history is definitely not part of the plan and
> a last resort.
>
>> Names. Is there a good reason not to use message numbers as the names
>> in the git repositories? (Other than the cost to change the code?) That
>> would remove the need for treat the sqlite msgmap database as precious,
>> and it would make it easier to recover if an nntp server goes away. In
>> V2 format the git mailing list git repository is only about 2M larger if
>> each message has it's msg number as it's name. Plus the git log
>> is easier to read as messages are all + or -.
>
> Big trees in git were a scalability problem in v1 because of the
> long 2/38 names. With shorter names you propose (base-10 serial
> number?, the scalability problem gets pushed off a bit, I suppose.
> But not indefinitely; and later v2 partitions will suffer more
> from longer names.
Bit trees were a scalability problem in git becuase they are quadratic.
Every commit mentioned every email. So a walk of the history would
have to visit every file on every commit. I expect those tree objects
in the history compress well with their parents but it doesn't simplify
the tree walker.
Would you like my test conversion script from V1 so you can take a look?
> I also want to limit the use and exposure of serial numbers as
> much as possible. It's unavoidable with the NNTP interface;
> but reliance on serial numbers in public interfaces leads to
> centralization.
I completely agree about public web interfaces. Message-ID is a much
better key to messages as it was generated by the message sender.
> The current v2 is also better for inode-starved users in case
> somebody forgets to type "--mirror" or "--bare" with clone. For
> the most part (unless purge is used), the SQLite database is
> actually recoverable.
Because of the parallelism in V2 I have noticed messages in numbered
in an order that does not correspond to their commit order. So the
SQLite database isn't as recoverable as it might be. Especially as the
parallelism introduces an element of non-determinancy.
> So no, I don't think having serial numbers stored in filenames
> is the right thing.
I won't push it but I at the present time I respectfully disagree.
The big advantage I see with serial numbers (other than msgmap) is that
you can include multiple emails per commit (without going quadratic). I
am also looking at potentially storing the other email states that IMAP
and maildir mailboxes track. I can imagine that much more easily with
message numbers. Still I want to avoid something that makes git go
quadratic again.
>> xapian. Can the Xapian database be made optional in V2?
>
> Definitely in the TODO :)
>
>> I absolutely
>> think a quick search for terms and other things very valuable, so I
>> would never suggest giving up Xapian. On the other hand on my personal
>> laptop the xapian database for lkml takes ages and ages to build, and it
>> pushes the system into swap. Which is all around unpleasant. That
>> seems to eat into the distributed nature of the goal of public inbox.
>> I have tried to see what could be done that might shrink the size of
>> the xapian database. The only think I could think of is perhaps
>> sharding the xapian database by time/msgnum ranges. That would allow
>> the old xapians databases to be compacted and forgotten about, and I
>> think it would allow less wastage in the current xapian database as it
>> would be smaller, so wasting 50% space (or whatever the btrees waste)
>> would be less of an issue. And as smaller databases are faster I think
>> that would in general be a help.
>
> One big killer for Xapian is position information required for
> "quoted phrase searches". I seem to remember deleting the position.*
> files was safe as it would only break phrase searches (but I
> haven't tried it).
I have a very ugly patch that removed all of Xapian. So for day to day
nntp use. It is certainly safe.
> So there should be an option to toggle between the "index_text"
> and routines in Xapian "index_text_without_positions".
I might take a look at that. I just looked and the position database is
huge.
> Given the way the indexing only works on the most recent data;
> I think one could also write a script to delete old data/results
> from Xapian without affecting current/future indexing.
> That would pop back up if/when there's schema upgrades requiring
> a rebuild, though...
Good for testing. Not for long term as it is the actual indexing that
is painful.
> I believe there should be 3 levels of v2 operation:
>
> 1) SQLite-only (NNTP and all the threading stuff works)
> 2) SQLite + Xapian w/o positions (good enough for most things)
> 3) SQLite + Xapian w/ positions (current, default)
>
> 2) seems like a reasonable trade-off for most sites; I'm not
> sure how often phrase searching gets used.
I will take a look at that. That seems a straight forward place to
start that we can easily agree upon.
>> Time permitting I am willing to do some of this work so that
>> public-inbox works well for me. I want to see what your vision is for
>> the code before I start anything.
>
> Thanks for running this by, first. I'm not convinced git layout
> changes are warranted at this point for v2.
>
> Making Xapian optional and configurable to use
> index_text_without_positions is something I definitely want to
> see happen, though.
I will clean up my patches for that then.
Eric
^permalinkrawreply [flat|nested] 21+ messages in thread

*Re: Q: V2 format
2018-07-12 13:58 ` ebiederm@ 2018-07-12 23:09 ` Eric Wong
2018-07-13 13:39 ` ebiederm0 siblings, 1 reply; 21+ messages in thread
From: Eric Wong @ 2018-07-12 23:09 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: meta
"Eric W. Biederman" <ebiederm@xmission.com> wrote:
> Eric Wong <e@80x24.org> writes:
> > "Eric W. Biederman" <ebiederm@xmission.com> wrote:
> >> I have been digging through the code looking so I can understand the v2
> >> format and I have some ideas on how things might be improved, and some
> >> questions so that I understand.
> >
> > Great to know you're interested! Fwiw, I've still been meaning
> > to turn my v2 docs into a POD manpage:
> >
> > https://public-inbox.org/meta/20180419015813.GA20051@dcvr/
>
> I have some personal mail archives that I need to do something better
> with. My goal is for day-to-day operations (aka mail delivery and
> archiving) to be able to run on a smallish 32bit machine.
Great to hear your interest in that! public-inbox.org is still
32-bit on a $20/month VPS. Xapian really does better with an
SSD (freshly TRIM-ed), though; so my low-end netbook with HDD
struggles on big inboxes at the moment.
> But archives are not valuable unless you have a fast search capability
> which makes all of the features of xapian very interesting.
Agreed.
> I need to compare message id's to see if I have content missing from the
> public linux-kernel archive. It is probably Konrad's cleanup of the
> headers but my linux-kernel archive when imported into public-inbox is
> slightly larger than Konrads.
Konrad == Konstantin? I haven't looked at what's in lore, yet,
but there were numerous header differences from the archives he
gave me for v2 development vs what I got from my own archives.
Off the top of my head:
* addresses in To:/Cc: lists rewritten for some old list addresses
* some addressee formatting/quoting changes as a result
* last (most recent) Received: header removed (but not actually
enough to anonymize the original recipient in most cases).
This affects sorting comparisons in search results
* reencoded some MIME parts to different encodings (to 8bit, I think)
Maybe some others.
> I also like the idea of being able to read and archive public lists that
> I care about with just a git fetch and local tools.
Yes. I still use "git log -p -B" etc. That said; I don't want
to give up too much to support that (the SQLite dependency doesn't
seem too expensive); and try to keep public-inbox easy-to-install.
Making Xapian optional will be a huge part of that.
> Public mailing lists and their archives are more important, but on my
> radar is also IMAP/regular email support. With it's little bit of extra
> state.
Cool. I've been thinking about something for personal mail,
too. mairix is killing my beefier personal machine (because it
needs to rewrite the entire index every time) and
Maildirs+notmuch is a non-starter due to dentry cache overheads
and inode consumption.
> >> What is the thinking about deleted entries, and for v2 what is the
> >> preferred way to delete mail from a public inbox git repository and why?
> >
> > Definitely prefer the normal way with 'd' files to not break
> > people using non-force fetches. "Purge" is too disruptive
> > and reserved for extraordinary cases (e.g. legal reasons).
>
> Then I am going to report a probable bug. In V2 in public-inbox-index
> I can not find a path from finding a 'd' file and a call to unindex. V1
> unindexes deleted files. Rebased heads for purges call unindex. I
> don't see that for ordinary d files though.
It shouldn't need to call unindex because they never get indexed
on rebuilds. V2 indexing walks history backwards (normal "git log"
behavior) so it remembers 'd' paths in the "$D" hash; and skips blobs
as it encounters them.
v1 needed to unindex because it used "git log --reverse" to walk
forward in history.
> >> Size. Reading the history of the public inbox meta mailling list and
> >> playing around I discovered that I can shave off about 100M of the V2
> >> size of the git public inbox git repository but pushing all of the
> >> messages into a single commit. Not great for day to day operation,
> >> but if rebasses are part of the plan, and old archives part of the
> >> challenge I see quite a lot of potential for old archives to be reduced
> >> to a git repository with a single commit.
> >
> > Rebases/rewriting history is definitely not part of the plan and
> > a last resort.
> >
> >> Names. Is there a good reason not to use message numbers as the names
> >> in the git repositories? (Other than the cost to change the code?) That
> >> would remove the need for treat the sqlite msgmap database as precious,
> >> and it would make it easier to recover if an nntp server goes away. In
> >> V2 format the git mailing list git repository is only about 2M larger if
> >> each message has it's msg number as it's name. Plus the git log
> >> is easier to read as messages are all + or -.
> >
> > Big trees in git were a scalability problem in v1 because of the
> > long 2/38 names. With shorter names you propose (base-10 serial
> > number?, the scalability problem gets pushed off a bit, I suppose.
> > But not indefinitely; and later v2 partitions will suffer more
> > from longer names.
>
> Bit trees were a scalability problem in git becuase they are quadratic.
> Every commit mentioned every email. So a walk of the history would
> have to visit every file on every commit. I expect those tree objects
> in the history compress well with their parents but it doesn't simplify
> the tree walker.
>
> Would you like my test conversion script from V1 so you can take a look?
Sure, but I can't guarantee I can find the time to spend on it;
but others might be interested.
> > The current v2 is also better for inode-starved users in case
> > somebody forgets to type "--mirror" or "--bare" with clone. For
> > the most part (unless purge is used), the SQLite database is
> > actually recoverable.
>
> Because of the parallelism in V2 I have noticed messages in numbered
> in an order that does not correspond to their commit order. So the
> SQLite database isn't as recoverable as it might be. Especially as the
> parallelism introduces an element of non-determinancy.
*puzzled* were you able to reproduce that? The serial number
generation + threading happens in the main process and the
parallelism is limited to Xapian text indexing. -index
generates serial numbers by walking backwards with v2, and
complains on unexpected results.
As far as personal mail goes, I wouldn't want serial numbers at all
(more unnecessary state to keep track of).
> > So no, I don't think having serial numbers stored in filenames
> > is the right thing.
>
> I won't push it but I at the present time I respectfully disagree.
>
> The big advantage I see with serial numbers (other than msgmap) is that
> you can include multiple emails per commit (without going quadratic). I
> am also looking at potentially storing the other email states that IMAP
> and maildir mailboxes track. I can imagine that much more easily with
> message numbers. Still I want to avoid something that makes git go
> quadratic again.
You'd want deeper trees; still. I'd still use hex, and maybe
truncate the blob hash to avoid having to keep track of any
serial number state. Maybe 2/2/4 naming is enough while using
git history to resolve collisions.
Multiple emails per-commit doesn't make sense for public
archives. For personal archives, you could probably snap off
1-file-per-commit history periodically to make make a big tree
to reduce commit objects. The cost of losing compatibility,
rewriting history + repacking, to save 100M there out of 1G(?)
or so doesn't seem like a great trade-off, though.
I wonder how much can be saved with short author/committer info
and empty commit messages, even. I'd rather do that than break
history and require repacking.
If I wanted to track replied/seen/etc... state in git for
personal mail, I'd probably use 'r', 's', etc filenames; but I'm
not sure it'd be in the same or different git repo from the
public one.
That said; I don't know if I want to store state in git or
SQLite or something else...
Looking forward to making Xapian and position data optional :>
^permalinkrawreply [flat|nested] 21+ messages in thread

*Re: Q: V2 format
2018-07-12 23:09 ` Eric Wong@ 2018-07-13 13:39 ` ebiederm
2018-07-13 20:03 ` ebiederm
` (2 more replies)0 siblings, 3 replies; 21+ messages in thread
From: ebiederm @ 2018-07-13 13:39 UTC (permalink / raw)
To: Eric Wong; +Cc: meta
[-- Attachment #1: Type: text/plain, Size: 10900 bytes --]
Eric Wong <e@80x24.org> writes:
> "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>> Eric Wong <e@80x24.org> writes:
>> > "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>> >> I have been digging through the code looking so I can understand the v2
>> >> format and I have some ideas on how things might be improved, and some
>> >> questions so that I understand.
>> >
>> > Great to know you're interested! Fwiw, I've still been meaning
>> > to turn my v2 docs into a POD manpage:
>> >
>> > https://public-inbox.org/meta/20180419015813.GA20051@dcvr/
>>
>> I have some personal mail archives that I need to do something better
>> with. My goal is for day-to-day operations (aka mail delivery and
>> archiving) to be able to run on a smallish 32bit machine.
>
> Great to hear your interest in that! public-inbox.org is still
> 32-bit on a $20/month VPS. Xapian really does better with an
> SSD (freshly TRIM-ed), though; so my low-end netbook with HDD
> struggles on big inboxes at the moment.
I am leery of SSDs at the moment. It was probably bad luck but my last
mail setup using cyrus (1 message per file) managed to kill an SSD in
under a year.
>> But archives are not valuable unless you have a fast search capability
>> which makes all of the features of xapian very interesting.
>
> Agreed.
>
>> I need to compare message id's to see if I have content missing from the
>> public linux-kernel archive. It is probably Konrad's cleanup of the
>> headers but my linux-kernel archive when imported into public-inbox is
>> slightly larger than Konrads.
>
> Konrad == Konstantin?
Yes. Konstantin Ryabitsev.
Konstantin, my apologies I did not mean to scramble your name.
> I haven't looked at what's in lore, yet,
> but there were numerous header differences from the archives he
> gave me for v2 development vs what I got from my own archives.
>
> Off the top of my head:
>
> * addresses in To:/Cc: lists rewritten for some old list addresses
>
> * some addressee formatting/quoting changes as a result
>
> * last (most recent) Received: header removed (but not actually
> enough to anonymize the original recipient in most cases).
> This affects sorting comparisons in search results
>
> * reencoded some MIME parts to different encodings (to 8bit, I think)
>
> Maybe some others.
>
>> I also like the idea of being able to read and archive public lists that
>> I care about with just a git fetch and local tools.
>
> Yes. I still use "git log -p -B" etc. That said; I don't want
> to give up too much to support that (the SQLite dependency doesn't
> seem too expensive); and try to keep public-inbox easy-to-install.
> Making Xapian optional will be a huge part of that.
What I meant is that it is very useful not to have to not need to sync
anything other than the git repository between machines.
>> Public mailing lists and their archives are more important, but on my
>> radar is also IMAP/regular email support. With it's little bit of extra
>> state.
>
> Cool. I've been thinking about something for personal mail,
> too. mairix is killing my beefier personal machine (because it
> needs to rewrite the entire index every time) and
> Maildirs+notmuch is a non-starter due to dentry cache overheads
> and inode consumption.
>
>> >> What is the thinking about deleted entries, and for v2 what is the
>> >> preferred way to delete mail from a public inbox git repository and why?
>> >
>> > Definitely prefer the normal way with 'd' files to not break
>> > people using non-force fetches. "Purge" is too disruptive
>> > and reserved for extraordinary cases (e.g. legal reasons).
>>
>> Then I am going to report a probable bug. In V2 in public-inbox-index
>> I can not find a path from finding a 'd' file and a call to unindex. V1
>> unindexes deleted files. Rebased heads for purges call unindex. I
>> don't see that for ordinary d files though.
>
> It shouldn't need to call unindex because they never get indexed
> on rebuilds. V2 indexing walks history backwards (normal "git log"
> behavior) so it remembers 'd' paths in the "$D" hash; and skips blobs
> as it encounters them.
>
> v1 needed to unindex because it used "git log --reverse" to walk
> forward in history.
This assumes that you see them in the same git pull. I would think
ideally anything that is going to be deleted that quickly you can just
skip archiving.
What is the time window of you expecting 'd' messages to appear?
>> >> Size. Reading the history of the public inbox meta mailling list and
>> >> playing around I discovered that I can shave off about 100M of the V2
>> >> size of the git public inbox git repository but pushing all of the
>> >> messages into a single commit. Not great for day to day operation,
>> >> but if rebasses are part of the plan, and old archives part of the
>> >> challenge I see quite a lot of potential for old archives to be reduced
>> >> to a git repository with a single commit.
>> >
>> > Rebases/rewriting history is definitely not part of the plan and
>> > a last resort.
>> >
>> >> Names. Is there a good reason not to use message numbers as the names
>> >> in the git repositories? (Other than the cost to change the code?) That
>> >> would remove the need for treat the sqlite msgmap database as precious,
>> >> and it would make it easier to recover if an nntp server goes away. In
>> >> V2 format the git mailing list git repository is only about 2M larger if
>> >> each message has it's msg number as it's name. Plus the git log
>> >> is easier to read as messages are all + or -.
>> >
>> > Big trees in git were a scalability problem in v1 because of the
>> > long 2/38 names. With shorter names you propose (base-10 serial
>> > number?, the scalability problem gets pushed off a bit, I suppose.
>> > But not indefinitely; and later v2 partitions will suffer more
>> > from longer names.
>>
>> Bit trees were a scalability problem in git becuase they are quadratic.
>> Every commit mentioned every email. So a walk of the history would
>> have to visit every file on every commit. I expect those tree objects
>> in the history compress well with their parents but it doesn't simplify
>> the tree walker.
>>
>> Would you like my test conversion script from V1 so you can take a look?
>
> Sure, but I can't guarantee I can find the time to spend on it;
> but others might be interested.
>
>> > The current v2 is also better for inode-starved users in case
>> > somebody forgets to type "--mirror" or "--bare" with clone. For
>> > the most part (unless purge is used), the SQLite database is
>> > actually recoverable.
>>
>> Because of the parallelism in V2 I have noticed messages in numbered
>> in an order that does not correspond to their commit order. So the
>> SQLite database isn't as recoverable as it might be. Especially as the
>> parallelism introduces an element of non-determinancy.
>
> *puzzled* were you able to reproduce that? The serial number
> generation + threading happens in the main process and the
> parallelism is limited to Xapian text indexing. -index
> generates serial numbers by walking backwards with v2, and
> complains on unexpected results.
I will have to look a bit deeper. It was just something I noticed in
passing as I was rewriting mail boxes with msgnum extracted from
sqllite. I will see if I can track that one done. I very much
value retaining enough information in the git archive to reconstruct
the serial numbers. So that all that is needs to be backed up is the
git archive. Even purge can insert a dummy entry so I don't think
there is any time when we would not be able to preserve them with the
current setup.
> As far as personal mail goes, I wouldn't want serial numbers at all
> (more unnecessary state to keep track of).
At least imap requires serial numbers, and I imagine the easy transition
for mail clients is to have an imap server. As you have mentioned an
ordered list of commits is good enough to reconstruct the msgnum
reliably so it is unlikely we would need to do anything special there.
>> > So no, I don't think having serial numbers stored in filenames
>> > is the right thing.
>>
>> I won't push it but I at the present time I respectfully disagree.
>>
>> The big advantage I see with serial numbers (other than msgmap) is that
>> you can include multiple emails per commit (without going quadratic). I
>> am also looking at potentially storing the other email states that IMAP
>> and maildir mailboxes track. I can imagine that much more easily with
>> message numbers. Still I want to avoid something that makes git go
>> quadratic again.
>
> You'd want deeper trees; still. I'd still use hex, and maybe
> truncate the blob hash to avoid having to keep track of any
> serial number state. Maybe 2/2/4 naming is enough while using
> git history to resolve collisions.
The key fundamental difference is if you keep the same files from
one commit to another. To demonstrate this I have attached a quick
conversion script I used to test this. It uses h{40} names. Totally
flat. "time git rev-list --objects --all | wc -l" on the git mailling list
archive takes just over 5 seconds.
Compared to your one file name case:
$ du -hs git/git/0.git/ git-long-names/git/0.git/
759M git/git/0.git/
772M git-long-names/git/0.git/
So the only difference is using shorter filenames you save 13M.
The original git tree in V1 format is 1001M so still 30M larger.
And "time git rev-list --objects --all | wc -l" takes 1m14s.
Making it definitely slower.
> Multiple emails per-commit doesn't make sense for public
> archives.
I am not certain. For a maillist like linux kernel especially when
someone sends a patch series to the list and it arrives all at once I
imagine there is potential there. I believe this is visible in the mail
delivery pipeline if you implement LMTP.
> For personal archives, you could probably snap off
> 1-file-per-commit history periodically to make make a big tree
> to reduce commit objects. The cost of losing compatibility,
> rewriting history + repacking, to save 100M there out of 1G(?)
> or so doesn't seem like a great trade-off, though.
It is significant. Mostly it seems to make sense for importing archives
or really compacting archives for storage.
> I wonder how much can be saved with short author/committer info
> and empty commit messages, even. I'd rather do that than break
> history and require repacking.
You seem to have saved 13M with one character file names.
> If I wanted to track replied/seen/etc... state in git for
> personal mail, I'd probably use 'r', 's', etc filenames; but I'm
> not sure it'd be in the same or different git repo from the
> public one.
>
> That said; I don't know if I want to store state in git or
> SQLite or something else...
Agreed. That all bears some careful looking into.
> Looking forward to making Xapian and position data optional :>
[-- Attachment #2: public-inbox-convert-long-names --]
[-- Type: text/plain, Size: 4277 bytes --]
#!/usr/bin/perl -w
# Copyright (C) 2018 all contributors <meta@public-inbox.org>
# License: AGPL-3.0+ <http://www.gnu.org/licenses/agpl-3.0.txt>
use strict;
use warnings;
use Getopt::Long qw(:config gnu_getopt no_ignore_case auto_abbrev);
use PublicInbox::MIME;
use PublicInbox::InboxWritable;
use PublicInbox::Config;
use PublicInbox::V2Writable;
use PublicInbox::Import;
use PublicInbox::Spawn qw(spawn);
use Cwd 'abs_path';
use File::Copy 'cp'; # preserves permissions:
my $usage = "Usage: public-inbox-convert OLD NEW\n";
my $jobs;
my $index = 1;
my %opts = (
'--jobs|j=i' => \$jobs,
'--index!' => \$index,
);
GetOptions(%opts) or die "bad command-line args\n$usage";
GetOptions(%opts) or die "bad command-line args\n$usage";
my $old_dir = shift or die $usage;
my $new_dir = shift or die $usage;
die "$new_dir exists\n" if -d $new_dir;
die "$old_dir not a directory\n" unless -d $old_dir;
my $config = eval { PublicInbox::Config->new };
$old_dir = abs_path($old_dir);
my $old;
if ($config) {
$config->each_inbox(sub {
$old = $_[0] if abs_path($_[0]->{mainrepo}) eq $old_dir;
});
}
unless ($old) {
warn "W: $old_dir not configured in " .
PublicInbox::Config::default_file() . "\n";
$old = {
mainrepo => $old_dir,
name => 'ignored',
address => [ 'old@example.com' ],
};
$old = PublicInbox::Inbox->new($old);
}
$old = PublicInbox::InboxWritable->new($old);
if (($old->{version} || 1) >= 2) {
die "Only conversion from v1 inboxes is supported\n";
}
my $new = { %$old };
$new->{mainrepo} = abs_path($new_dir);
$new->{version} = 2;
$new = PublicInbox::InboxWritable->new($new);
my $v2w;
$old->umask_prepare;
sub link_or_copy ($$) {
my ($src, $dst) = @_;
link($src, $dst) and return;
$!{EXDEV} or warn "link $src, $dst failed: $!, trying cp\n";
cp($src, $dst) or die "cp $src, $dst failed: $!\n";
}
$old->with_umask(sub {
my $old_cfg = "$old->{mainrepo}/config";
local $ENV{GIT_CONFIG} = $old_cfg;
my $new_cfg = "$new->{mainrepo}/all.git/config";
$v2w = PublicInbox::V2Writable->new($new, 1);
$v2w->init_inbox($jobs);
unlink $new_cfg;
link_or_copy($old_cfg, $new_cfg);
if (my $alt = $new->{altid}) {
require PublicInbox::AltId;
foreach my $i (0..$#$alt) {
my $src = PublicInbox::AltId->new($old, $alt->[$i], 0);
$src->mm_alt or next;
my $dst = PublicInbox::AltId->new($new, $alt->[$i], 1);
$dst = $dst->{filename};
$src->mm_alt->{dbh}->sqlite_backup_to_file($dst);
}
}
my $desc = "$old->{mainrepo}/description";
link_or_copy($desc, "$new->{mainrepo}/description") if -e $desc;
my $clone = "$old->{mainrepo}/cloneurl";
if (-e $clone) {
warn <<"";
$clone may not be valid after migrating to v2, not copying
}
});
my $state = '';
my ($prev, $from);
my $head = $old->{ref_head} || 'HEAD';
my ($rd, $pid) = $old->git->popen(qw(fast-export --use-done-feature), $head);
$v2w->idx_init;
my $im = $v2w->importer;
my ($r, $w) = $im->gfi_start;
my $h = '[0-9a-f]';
my %D;
my $purged = 0;
while (<$rd>) {
if ($_ eq "blob\n") {
$state = 'blob';
} elsif (/^commit /) {
$state = 'commit';
$purged = 0;
} elsif (/^data (\d+)/) {
my $len = $1;
$w->print($_) or $im->wfail;
while ($len) {
my $n = read($rd, my $tmp, $len) or die "read: $!";
warn "$n != $len\n" if $n != $len;
$len -= $n;
$w->print($tmp) or $im->wfail;
}
next;
} elsif ($state eq 'commit') {
if (m/^([MDcRN] | deleteall)/) {
if (!$purged) {
$purged = 1;
$w->print("deleteall\n") or $im->wfail;
}
}
if (m{^M 100644 :(\d+) (${h}{2})/(${h}{38})}o) {
my ($mark, $path) = ($1, $2 . $3);
${D}{$path} = $mark;
$w->print("M 100644 :$mark $path\n") or $im->wfail;
next;
}
if (m{^D (${h}{2})/(${h}{38})}o) {
my $path = $1 . $2;
my $mark = delete $D{$path};
defined $mark or die "undeleted path: $1\n";
$w->print("M 100644 :$mark d\n") or $im->wfail;
next;
}
if (m{^from (:\d+)}) {
$prev = $from;
$from = $1;
# no next
}
}
last if $_ eq "done\n";
$w->print($_) or $im->wfail;
}
$w = $r = undef;
close $rd or die "close fast-export: $!\n";
waitpid($pid, 0) or die "waitpid failed: $!\n";
$? == 0 or die "fast-export failed: $?\n";
my $mm = $old->mm;
$mm->{dbh}->sqlite_backup_to_file("$new_dir/msgmap.sqlite3") if $mm;
$v2w->done;
if ($index) {
$v2w->index_sync;
$v2w->done;
}
^permalinkrawreply [flat|nested] 21+ messages in thread

*Re: Q: V2 format
2018-07-13 13:39 ` ebiederm@ 2018-07-13 20:03 ` ebiederm
2018-07-13 22:22 ` msgmap serial number regeneration [was: Q: V2 format] Eric Wong
2018-07-13 22:02 ` bug: v2 deletes on incremental fetch " Eric Wong
2018-07-13 23:07 ` IMAP server [was: Q: V2 format] Eric Wong
2 siblings, 1 reply; 21+ messages in thread
From: ebiederm @ 2018-07-13 20:03 UTC (permalink / raw)
To: Eric Wong; +Cc: meta
ebiederm@xmission.com (Eric W. Biederman) writes:
> Eric Wong <e@80x24.org> writes:
>
>> "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>>>
>>> Because of the parallelism in V2 I have noticed messages in numbered
>>> in an order that does not correspond to their commit order. So the
>>> SQLite database isn't as recoverable as it might be. Especially as the
>>> parallelism introduces an element of non-determinancy.
>>
>> *puzzled* were you able to reproduce that? The serial number
>> generation + threading happens in the main process and the
>> parallelism is limited to Xapian text indexing. -index
>> generates serial numbers by walking backwards with v2, and
>> complains on unexpected results.
Digging into this I have found consistenly non-reproducible numbering,
because of deleted files. Apparently in both V1 and V2 an a worst-case
estimate is made of the total numbers that are going to be needed and
numbers are assigned backwards from there.
A fresh indexing of the git mailling list archive on v1 gives me numbers
starting with 360 and on v2 numbers starting with 355. Which
corresponds with the number of deleted messages.
I am still looking to see if there are any other weird things here.
I definitely do not like not being able to reconstruct message numbers
from a backup.
Eric
^permalinkrawreply [flat|nested] 21+ messages in thread

*msgmap serial number regeneration [was: Q: V2 format]
2018-07-13 20:03 ` ebiederm@ 2018-07-13 22:22 ` Eric Wong
2018-07-14 19:01 ` ebiederm0 siblings, 1 reply; 21+ messages in thread
From: Eric Wong @ 2018-07-13 22:22 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: meta
"Eric W. Biederman" <ebiederm@xmission.com> wrote:
> ebiederm@xmission.com (Eric W. Biederman) writes:
> > Eric Wong <e@80x24.org> writes:
> >> "Eric W. Biederman" <ebiederm@xmission.com> wrote:
> >>>
> >>> Because of the parallelism in V2 I have noticed messages in numbered
> >>> in an order that does not correspond to their commit order. So the
> >>> SQLite database isn't as recoverable as it might be. Especially as the
> >>> parallelism introduces an element of non-determinancy.
> >>
> >> *puzzled* were you able to reproduce that? The serial number
> >> generation + threading happens in the main process and the
> >> parallelism is limited to Xapian text indexing. -index
> >> generates serial numbers by walking backwards with v2, and
> >> complains on unexpected results.
>
> Digging into this I have found consistenly non-reproducible numbering,
> because of deleted files. Apparently in both V1 and V2 an a worst-case
> estimate is made of the total numbers that are going to be needed and
> numbers are assigned backwards from there.
>
> A fresh indexing of the git mailling list archive on v1 gives me numbers
> starting with 360 and on v2 numbers starting with 355. Which
> corresponds with the number of deleted messages.
>
> I am still looking to see if there are any other weird things here.
Ah, yes, you're correct deletes don't get accounted for when
regenerating. Oh well. I guess it was correct to document msgmap
as something important to backup and not break for instances of
particular servers. (emphasis on "particular servers")
So I think you'd need to walk revision history twice to account
for deleted messages...
Across different machines, it should not matter to preserve
serials.
> I definitely do not like not being able to reconstruct message numbers
> from a backup.
For v2, I see serial numbers are an internal optimization which
happens to map to NNTP.
If the git repo is cloned and the cloner sets up a different
server, it'll have a different address and clients won't know to
deduplicate them anyways. I suppose it makes the load-balanced
case a little more complex to sync(*)
And this can't even account for independently started mirrors
with no common git ancestry, as SMTP has zero guarantees on
ordering.
(*) But optimizing for load-balanced instances isn't ideal,
I'd rather see more independently-run servers than giant
load-balanced instances which everybody relies on.
^permalinkrawreply [flat|nested] 21+ messages in thread

*IMAP server [was: Q: V2 format]
2018-07-13 13:39 ` ebiederm
2018-07-13 20:03 ` ebiederm
2018-07-13 22:02 ` bug: v2 deletes on incremental fetch " Eric Wong
@ 2018-07-13 23:07 ` Eric Wong
2018-07-13 23:12 ` ebiederm
2018-09-28 20:10 ` Johannes Berg2 siblings, 2 replies; 21+ messages in thread
From: Eric Wong @ 2018-07-13 23:07 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: meta
"Eric W. Biederman" <ebiederm@xmission.com> wrote:
> > "Eric W. Biederman" <ebiederm@xmission.com> wrote:
> >> Eric Wong <e@80x24.org> writes:
> > As far as personal mail goes, I wouldn't want serial numbers at all
> > (more unnecessary state to keep track of).
>
> At least imap requires serial numbers, and I imagine the easy transition
> for mail clients is to have an imap server. As you have mentioned an
> ordered list of commits is good enough to reconstruct the msgnum
> reliably so it is unlikely we would need to do anything special there.
I would rather layer IMAP (and POP3) on top of NNTP than to tie
it to any git/SQLite/Xapian parts in public-inbox. We could
ship it with public-inbox, of course; but I don't see why an
IMAP or POP3 server could not work by using innd (or similar) as
a backend.
I don't think any design compromises need to be made to existing
the git/SQLite/Xapian parts to support IMAP/POP3.
Hosting an IMAP/POP3 server is way more overhead for the admin
as it requires storing user credentials and storing per-reader
state. So the preference is to do NNTP as well as possible and
layer the complexity of per-user account data on top of it.
Right now, none of the NNTP/HTTP parts require write access
to the machine it runs on aside from log files.
Thus the goal is to promote NNTP usage as it's cheapest/easiest
for the server admin; but to still have IMAP/POP3 as stopgaps
(similar to the ssoma/mlmmj-replay script I use to allow SMTP
subscriptions to this inbox).
^permalinkrawreply [flat|nested] 21+ messages in thread

*Re: IMAP server [was: Q: V2 format]
2018-07-13 23:07 ` IMAP server [was: Q: V2 format] Eric Wong
@ 2018-07-13 23:12 ` ebiederm
2018-09-28 20:10 ` Johannes Berg1 sibling, 0 replies; 21+ messages in thread
From: ebiederm @ 2018-07-13 23:12 UTC (permalink / raw)
To: Eric Wong; +Cc: meta
Eric Wong <e@80x24.org> writes:
> "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>> > "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>> >> Eric Wong <e@80x24.org> writes:
>> > As far as personal mail goes, I wouldn't want serial numbers at all
>> > (more unnecessary state to keep track of).
>>
>> At least imap requires serial numbers, and I imagine the easy transition
>> for mail clients is to have an imap server. As you have mentioned an
>> ordered list of commits is good enough to reconstruct the msgnum
>> reliably so it is unlikely we would need to do anything special there.
>
> I would rather layer IMAP (and POP3) on top of NNTP than to tie
> it to any git/SQLite/Xapian parts in public-inbox. We could
> ship it with public-inbox, of course; but I don't see why an
> IMAP or POP3 server could not work by using innd (or similar) as
> a backend.
>
> I don't think any design compromises need to be made to existing
> the git/SQLite/Xapian parts to support IMAP/POP3.
>
> Hosting an IMAP/POP3 server is way more overhead for the admin
> as it requires storing user credentials and storing per-reader
> state. So the preference is to do NNTP as well as possible and
> layer the complexity of per-user account data on top of it.
>
> Right now, none of the NNTP/HTTP parts require write access
> to the machine it runs on aside from log files.
>
> Thus the goal is to promote NNTP usage as it's cheapest/easiest
> for the server admin; but to still have IMAP/POP3 as stopgaps
> (similar to the ssoma/mlmmj-replay script I use to allow SMTP
> subscriptions to this inbox).
That makes complete sense. I definitely agree that NNTP should be what
is optimized for.
Eric
^permalinkrawreply [flat|nested] 21+ messages in thread

*Re: msgmap serial number regeneration [was: Q: V2 format]
2018-07-13 22:22 ` msgmap serial number regeneration [was: Q: V2 format] Eric Wong
@ 2018-07-14 19:01 ` ebiederm
2018-07-15 3:18 ` Eric Wong0 siblings, 1 reply; 21+ messages in thread
From: ebiederm @ 2018-07-14 19:01 UTC (permalink / raw)
To: Eric Wong; +Cc: meta
Eric Wong <e@80x24.org> writes:
> "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>> ebiederm@xmission.com (Eric W. Biederman) writes:
>> > Eric Wong <e@80x24.org> writes:
>> >> "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>> >>>
>> >>> Because of the parallelism in V2 I have noticed messages in numbered
>> >>> in an order that does not correspond to their commit order. So the
>> >>> SQLite database isn't as recoverable as it might be. Especially as the
>> >>> parallelism introduces an element of non-determinancy.
>> >>
>> >> *puzzled* were you able to reproduce that? The serial number
>> >> generation + threading happens in the main process and the
>> >> parallelism is limited to Xapian text indexing. -index
>> >> generates serial numbers by walking backwards with v2, and
>> >> complains on unexpected results.
>>
>> Digging into this I have found consistenly non-reproducible numbering,
>> because of deleted files. Apparently in both V1 and V2 an a worst-case
>> estimate is made of the total numbers that are going to be needed and
>> numbers are assigned backwards from there.
>>
>> A fresh indexing of the git mailling list archive on v1 gives me numbers
>> starting with 360 and on v2 numbers starting with 355. Which
>> corresponds with the number of deleted messages.
>>
>> I am still looking to see if there are any other weird things here.
>
> Ah, yes, you're correct deletes don't get accounted for when
> regenerating. Oh well. I guess it was correct to document msgmap
> as something important to backup and not break for instances of
> particular servers. (emphasis on "particular servers")
>
> So I think you'd need to walk revision history twice to account
> for deleted messages...
>
> Across different machines, it should not matter to preserve
> serials.
I believe we can modify the msg number assignment to assign numbers to
deletes as well as adds. Short of the same Message-ID coming up twice
that should be enough for the current backwards loop to assign message
ids reliably. And even Message-IDs comming up twice is handle-able.
>> I definitely do not like not being able to reconstruct message numbers
>> from a backup.
>
> For v2, I see serial numbers are an internal optimization which
> happens to map to NNTP.
>
> If the git repo is cloned and the cloner sets up a different
> server, it'll have a different address and clients won't know to
> deduplicate them anyways. I suppose it makes the load-balanced
> case a little more complex to sync(*)
But if the server hardware fails. The case I am dealing with at the
moment I can stand up a new server with the same ip address.
Further if we can make everything but the git repository non-essential
it yields more flexibility for changing and optimizing things in the
future.
> (*) But optimizing for load-balanced instances isn't ideal,
> I'd rather see more independently-run servers than giant
> load-balanced instances which everybody relies on.
True.
At this point I am just optimizing for my own operational simplicity
of my own indpendentyly-run server.
Eric
^permalinkrawreply [flat|nested] 21+ messages in thread

*Re: msgmap serial number regeneration [was: Q: V2 format]
2018-07-14 19:01 ` ebiederm@ 2018-07-15 3:18 ` Eric Wong
2018-07-16 15:20 ` ebiederm0 siblings, 1 reply; 21+ messages in thread
From: Eric Wong @ 2018-07-15 3:18 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: meta
"Eric W. Biederman" <ebiederm@xmission.com> wrote:
> I believe we can modify the msg number assignment to assign numbers to
> deletes as well as adds. Short of the same Message-ID coming up twice
> that should be enough for the current backwards loop to assign message
> ids reliably. And even Message-IDs comming up twice is handle-able.
OK, I would likely accept a patch to fix that.
A note about Message-ID uniqueness... The v2 code will generate
a new, truly unique Message-ID on duplicates and use that in
msgmap instead what was in the message. It's gross, but I needed
to do that to allow all messages to be accessible via Message-ID
over NNTP, because:
a) some legit messages reuse Message-IDs :<
b) some broken mailers (including some versions of git-send-email)
put multiple Message-IDs in the same message, so the code
needs to handle messages with any number of Message-IDs
anyways.
^permalinkrawreply [flat|nested] 21+ messages in thread

*Re: msgmap serial number regeneration [was: Q: V2 format]
2018-07-15 3:18 ` Eric Wong@ 2018-07-16 15:20 ` ebiederm0 siblings, 0 replies; 21+ messages in thread
From: ebiederm @ 2018-07-16 15:20 UTC (permalink / raw)
To: Eric Wong; +Cc: meta
Eric Wong <e@80x24.org> writes:
> "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>> I believe we can modify the msg number assignment to assign numbers to
>> deletes as well as adds. Short of the same Message-ID coming up twice
>> that should be enough for the current backwards loop to assign message
>> ids reliably. And even Message-IDs comming up twice is handle-able.
>
> OK, I would likely accept a patch to fix that.
>
> A note about Message-ID uniqueness... The v2 code will generate
> a new, truly unique Message-ID on duplicates and use that in
> msgmap instead what was in the message. It's gross, but I needed
> to do that to allow all messages to be accessible via Message-ID
> over NNTP, because:
>
> a) some legit messages reuse Message-IDs :<
>
> b) some broken mailers (including some versions of git-send-email)
> put multiple Message-IDs in the same message, so the code
> needs to handle messages with any number of Message-IDs
> anyways.
I will send the patch along shortly.
I mispoke when I said the problem could be fixed by assigning numbers to
deletes. The actual problem was that not every add was assigned a
number. So the fix simpler than I expected.
It is interesting to note that
INSERT
DEL
INSERT
In sqlite does not reuse numbers in the primary key. So not
reassigning numbers is what the local sqlite data base does as well.
I need to track down what the v1 bug with add-remove-add was. I think
the way I have updated the code I won't need the bug fix for v1. But I
haven't checked that scenario yet.
I also need to write a test case sigh.
But in practice I have this working for git mailling list archive.
Eric
^permalinkrawreply [flat|nested] 21+ messages in thread

*Re: IMAP server [was: Q: V2 format]
2018-07-13 23:07 ` IMAP server [was: Q: V2 format] Eric Wong
2018-07-13 23:12 ` ebiederm@ 2018-09-28 20:10 ` Johannes Berg
2018-09-28 21:01 ` ebiederm1 sibling, 1 reply; 21+ messages in thread
From: Johannes Berg @ 2018-09-28 20:10 UTC (permalink / raw)
To: Eric Wong, Eric W. Biederman; +Cc: meta
Sorry to just jump into an old thread; I was wondering about IMAP server
support as well, in particular because unlike NNTP that allows pushing
the search to the server, and that would be useful for local archives.
> Hosting an IMAP/POP3 server is way more overhead for the admin
> as it requires storing user credentials and storing per-reader
> state. So the preference is to do NNTP as well as possible and
> layer the complexity of per-user account data on top of it.
I'm not really sure that's true; dovecot, for example, provides their
lists archives via anonymous IMAP:
https://www.dovecot.org/mailinglists.html
They have instructions here on how to do that over dovecot:
https://wiki2.dovecot.org/HowTo/ReadOnlyArchive
In particular:
/var/home/anonymous/control# ls -la
drwxr-xr-x 3 root root 4096 May 25 15:43 ./
drwxr-xr-x 3 anondove root 4096 Mar 20 14:39 .imap/
-rw-r--r-- 1 root root 33 May 25 15:43 .subscriptions
Create the .subscriptions file manually to contain all the mailboxes
you. Note that the control directory isn't writable by anondove, so
that the subscriptions can't be changed.
[...]
* INBOX must always exists even if it's empty. Make sure it's not
writable.
* Make sure the mail directory isn't writable so users can't create new
mailboxes.
* The mboxes can be placed in the directory itself, or symlinks can be
used. Above you'll see that mailman places all Dovecot archives under
/var/home/archives. Make sure none of these files are writable by
anondove.
They also set up some read-only ACLs, I think to make the read-only
state clear to the user agent, but of course a public-inbox IMAP server
can hard-code all of this and not accept any write commands to start
with.
Anyway, just FYI; since I don't know perl at all I don't think I'll be
doing any work on this.
johannes
^permalinkrawreply [flat|nested] 21+ messages in thread

*Re: IMAP server [was: Q: V2 format]
2018-09-28 20:10 ` Johannes Berg@ 2018-09-28 21:01 ` ebiederm
2018-10-01 7:46 ` Johannes Berg0 siblings, 1 reply; 21+ messages in thread
From: ebiederm @ 2018-09-28 21:01 UTC (permalink / raw)
To: Johannes Berg; +Cc: Eric Wong, meta
Johannes Berg <johannes@sipsolutions.net> writes:
> Sorry to just jump into an old thread; I was wondering about IMAP server
> support as well, in particular because unlike NNTP that allows pushing
> the search to the server, and that would be useful for local archives.
>
>> Hosting an IMAP/POP3 server is way more overhead for the admin
>> as it requires storing user credentials and storing per-reader
>> state. So the preference is to do NNTP as well as possible and
>> layer the complexity of per-user account data on top of it.
>
> I'm not really sure that's true; dovecot, for example, provides their
> lists archives via anonymous IMAP:
> https://www.dovecot.org/mailinglists.html
>
> They have instructions here on how to do that over dovecot:
> https://wiki2.dovecot.org/HowTo/ReadOnlyArchive
>
> In particular:
>
> /var/home/anonymous/control# ls -la
> drwxr-xr-x 3 root root 4096 May 25 15:43 ./
> drwxr-xr-x 3 anondove root 4096 Mar 20 14:39 .imap/
> -rw-r--r-- 1 root root 33 May 25 15:43 .subscriptions
>
> Create the .subscriptions file manually to contain all the mailboxes
> you. Note that the control directory isn't writable by anondove, so
> that the subscriptions can't be changed.
>
> [...]
>
> * INBOX must always exists even if it's empty. Make sure it's not
> writable.
> * Make sure the mail directory isn't writable so users can't create new
> mailboxes.
> * The mboxes can be placed in the directory itself, or symlinks can be
> used. Above you'll see that mailman places all Dovecot archives under
> /var/home/archives. Make sure none of these files are writable by
> anondove.
>
> They also set up some read-only ACLs, I think to make the read-only
> state clear to the user agent, but of course a public-inbox IMAP server
> can hard-code all of this and not accept any write commands to start
> with.
>
> Anyway, just FYI; since I don't know perl at all I don't think I'll be
> doing any work on this.
I have looked at gnus and there is support in there for performing
searches via the old gmane web interface. Public inbox already provides
an attribute that tells you what the web server is. So all it will
really take is someone with a little time to wire up the search
interface.
Beyond that if you have the archives local (and that is easy) it is
quite possible to just git grep through them and find things of
interest.
I should verify this but I don't think IMAP has a good version of the
NNTP overview database. Which seems to make IMAP quite a bit slower for
reading news. Certainly gnus+public-inbox locally is running quite a
bit faster than my old gnus+cyrus-imap configuration.
I tried to read through the IMAP search specification to see how it
compares with what public-inbox makes available and I did not get
particularly far. It was not easy to match up the various search
capabilities. The biggest issue is that IMAP tends to not talk
about message-ids. Where that is fundamentally one of the most
important fields to index if you are dealing with threaded mail.
So long story short while I am not opposed to a read-only IMAP
configuration I think NNTP has much to recommend it. I do think we need
little things like SSL support for NNTP. Just to prevent inappropriate
access to traffic in flight.
It won't be for a while yet but I have some scripts I need to push at
least to the public-inbox scripts directory that simplify the process
taking a single email address subscribing to email and sorting it out
into different public-inbox git archives. Currently I have every
mailling list I am subscribed to pushed into public-inbox.
Eric
^permalinkrawreply [flat|nested] 21+ messages in thread

*Re: IMAP server [was: Q: V2 format]
2018-09-28 21:01 ` ebiederm@ 2018-10-01 7:46 ` Johannes Berg
2018-10-01 8:51 ` ebiederm0 siblings, 1 reply; 21+ messages in thread
From: Johannes Berg @ 2018-10-01 7:46 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Eric Wong, meta
On Fri, 2018-09-28 at 23:01 +0200, Eric W. Biederman wrote:
>
> I have looked at gnus and there is support in there for performing
> searches via the old gmane web interface. Public inbox already provides
> an attribute that tells you what the web server is. So all it will
> really take is someone with a little time to wire up the search
> interface.
That's ... interesting, but of course completely out-of-band. I'm not
sure it should or could be advocated that every email client actually
implement that :-)
But if you think broader than that, you don't even necessarily need a
web server to run p-i.
> Beyond that if you have the archives local (and that is easy) it is
> quite possible to just git grep through them and find things of
> interest.
That also doesn't use the index, not sure how that's any better?
> I should verify this but I don't think IMAP has a good version of the
> NNTP overview database. Which seems to make IMAP quite a bit slower for
> reading news. Certainly gnus+public-inbox locally is running quite a
> bit faster than my old gnus+cyrus-imap configuration.
IMAP servers typically should do header/MIME parsing, so you should be
able to query such a thing - but not as easily as XOVER, I suppose.
However, I think FETCH could be made to return the data similar to
XOVER, though it may not be backed by a pre-created database file, and
it depends on what the client does to show the overview in the first
place.
> I tried to read through the IMAP search specification to see how it
> compares with what public-inbox makes available and I did not get
> particularly far. It was not easy to match up the various search
> capabilities. The biggest issue is that IMAP tends to not talk
> about message-ids. Where that is fundamentally one of the most
> important fields to index if you are dealing with threaded mail.
You can search for arbitrary headers in search by using
HEADER <field-name> <string>
where the string is "contains", so you can use it for both Message-Id
and References headers.
> So long story short while I am not opposed to a read-only IMAP
> configuration I think NNTP has much to recommend it. I do think we need
> little things like SSL support for NNTP. Just to prevent inappropriate
> access to traffic in flight.
Sure. I'm not saying NNTP is bad, just saying that the choice of clients
is rather limited. Also, posting isn't supported over NNTP, so if I had
it all in my email client I could read in the public-inbox archive, and
respond via normal email.
> It won't be for a while yet but I have some scripts I need to push at
> least to the public-inbox scripts directory that simplify the process
> taking a single email address subscribing to email and sorting it out
> into different public-inbox git archives. Currently I have every
> mailling list I am subscribed to pushed into public-inbox.
:-)
johannes
^permalinkrawreply [flat|nested] 21+ messages in thread

*Re: IMAP server [was: Q: V2 format]
2018-10-01 7:46 ` Johannes Berg@ 2018-10-01 8:51 ` ebiederm0 siblings, 0 replies; 21+ messages in thread
From: ebiederm @ 2018-10-01 8:51 UTC (permalink / raw)
To: Johannes Berg; +Cc: Eric Wong, meta
Johannes Berg <johannes@sipsolutions.net> writes:
> On Fri, 2018-09-28 at 23:01 +0200, Eric W. Biederman wrote:
>>
>> I have looked at gnus and there is support in there for performing
>> searches via the old gmane web interface. Public inbox already provides
>> an attribute that tells you what the web server is. So all it will
>> really take is someone with a little time to wire up the search
>> interface.
>
> That's ... interesting, but of course completely out-of-band. I'm not
> sure it should or could be advocated that every email client actually
> implement that :-)
>
> But if you think broader than that, you don't even necessarily need a
> web server to run p-i.
>> Beyond that if you have the archives local (and that is easy) it is
>> quite possible to just git grep through them and find things of
>> interest.
>
> That also doesn't use the index, not sure how that's any better?
So for linux-kernel. I have 7G for the git email archive and 65G more
for the indexes. Which makes the indexes quite expensive. So for
personal use I am not certain an archive is a benefit. Especially when
the email archive fits in ram and the index does not.
I have to wonder if there is a way to make the indexes an order of
magnitude smaller.
>> I should verify this but I don't think IMAP has a good version of the
>> NNTP overview database. Which seems to make IMAP quite a bit slower for
>> reading news. Certainly gnus+public-inbox locally is running quite a
>> bit faster than my old gnus+cyrus-imap configuration.
>
> IMAP servers typically should do header/MIME parsing, so you should be
> able to query such a thing - but not as easily as XOVER, I suppose.
>
> However, I think FETCH could be made to return the data similar to
> XOVER, though it may not be backed by a pre-created database file, and
> it depends on what the client does to show the overview in the first
> place.
>> I tried to read through the IMAP search specification to see how it
>> compares with what public-inbox makes available and I did not get
>> particularly far. It was not easy to match up the various search
>> capabilities. The biggest issue is that IMAP tends to not talk
>> about message-ids. Where that is fundamentally one of the most
>> important fields to index if you are dealing with threaded mail.
>
> You can search for arbitrary headers in search by using
>
> HEADER <field-name> <string>
>
> where the string is "contains", so you can use it for both Message-Id
> and References headers.
>> So long story short while I am not opposed to a read-only IMAP
>> configuration I think NNTP has much to recommend it. I do think we need
>> little things like SSL support for NNTP. Just to prevent inappropriate
>> access to traffic in flight.
>
> Sure. I'm not saying NNTP is bad, just saying that the choice of clients
> is rather limited. Also, posting isn't supported over NNTP, so if I had
> it all in my email client I could read in the public-inbox archive, and
> respond via normal email.
The thing I can confirm and I have gotten as far as is that nntp has a
sequential message id, and IMAP has a sequential message id and
public-inbox has a sequential message id (now reliably based upon the
order of the messages in the git archive). So it is very possible to
have a read-only IMAP view.
The really noticable downside of IMAP is that it does want to keep the
status of messages you have read on the server. That makes a read-only
archive a bit of a pain.
So I am not certain the choice of clients when you restrict IMAP to what
is an advantage. Nor am I certain the general IMAP search functionality
maps well to what public-inbox indexes or people want to search for.
Which is me again saying while things can make I am not certain IMAP is
the best protocol for the job.
>> It won't be for a while yet but I have some scripts I need to push at
>> least to the public-inbox scripts directory that simplify the process
>> taking a single email address subscribing to email and sorting it out
>> into different public-inbox git archives. Currently I have every
>> mailling list I am subscribed to pushed into public-inbox.
>
> :-)
I do love that public-inbox makes it very easy to archive all my content
and still be able to take it all with me when I travel.
Eric
^permalinkrawreply [flat|nested] 21+ messages in thread