What is it about Windows that makes you call it a distant
possibility? Is it just that you are unfamiliar with it or is
there some specific OS level feature you plan on needing?

This is mostly because I wanted to abuse lazy commit of POSIX.
Now that I think of it Windows is mostly ok, except for the fork
trick used in concurrent GC. As Vladimir pointed out on Windows
there are other ways to do it but they are more involved.
---
Dmitry Olshansky

What is it about Windows that makes you call it a distant
possibility? Is it just that you are unfamiliar with it or is
there some specific OS level feature you plan on needing?

This is mostly because I wanted to abuse lazy commit of POSIX.
Now that I think of it Windows is mostly ok, except for the
fork trick used in concurrent GC. As Vladimir pointed out on
Windows there are other ways to do it but they are more
involved.
---
Dmitry Olshansky

BTW, Rainer Schuetze has studied this in detail and has written
down some of it here:
http://rainers.github.io/visuald/druntime/concurrentgc.html

My take on D's GC problem, also spoiler - I'm going to build a new one
soonish.
http://olshansky.me/gc/runtime/dlang/2017/06/14/inside-d-gc.html
---
Dmitry Olshansky

Very informative, thanks.
However, I can think of many reasons like appreciation the efforts of
the original authors to tone it down a little bit like changing
"mistake" to "optimization opportunity", "criticism" to "observation",
etc. :)
Ali

My take on D's GC problem, also spoiler - I'm going to build a
new one
soonish.
http://olshansky.me/gc/runtime/dlang/2017/06/14/inside-d-gc.html
---
Dmitry Olshansky

Very informative, thanks.
However, I can think of many reasons like appreciation the
efforts of the original authors to tone it down a little bit
like changing "mistake" to "optimization opportunity",
"criticism" to "observation", etc. :)

I could call it a problem :) Still one reason I didn't go to D
blog to post this is because it's a critique followed by a
promise of action though.

My take on D's GC problem, also spoiler - I'm going to build a new one
soonish.
http://olshansky.me/gc/runtime/dlang/2017/06/14/inside-d-gc.html

[...]
Very interesting indeed!
One question about killing the no interior pointer attribute: would this
be problematic for 32-bit platforms? And if so, what do you plan to do
about it? Keep the current GC as version(32bit) and your new version as
version(64bit)?
One (potentially crazy) idea that occurred to me while reading your post
is TLS allocations. I haven't thought through the details of how this
would interact with the existing language yet, but would it make sense
for some allocations that you know will never be shared across threads
to be allocated in a thread-local pool instead of the global pool? I.e.,
in addition to the global set of memory pools you also have thread-local
memory pools. Then you could potentially run collections per-thread
rather than stop-the-world.
For example, if you have a bunch of threads that call a function that
does a bunch of short-lived allocations that are not shared across
threads, it seems to wasteful to have these allocations add to the
global GC load. Why not have them go into a local pool that can be
collected per-thread? Of course, whether the current language can take
advantage of this is another matter. Perhaps if the function is pure
and returns scope, then you know any allocation it makes can't possibly
be shared with other threads, or something like that...
On Mon, Jun 19, 2017 at 10:50:05PM +0000, Adam D. Ruppe via Digitalmars-d wrote:

What is it about Windows that makes you call it a distant possibility?
Is it just that you are unfamiliar with it or is there some specific
OS level feature you plan on needing?

He mentioned the "fork trick", which I assume refers to how Linux's
implementation of fork() uses copy-on-write rather than immediately
duplicating the parent process' memory structures. There was a D1 GC
some time ago that depended on this behaviour to speed up the collection
cycle. AFAIK, Windows does not have equivalent functionality to this.
(Well, for that matter, I'm not sure Posix in general has this feature
either, since AFAIK it's Linux-specific. But I surmise that modern-day
*nix flavors probably have adopted this in one way or another, since
otherwise the very common pattern of fork-and-exec would be inordinately
expensive -- copying all the parent's pages only to replace them all
pretty much immediately.)
T
--
Give me some fresh salted fish, please.

it is higly depends of undocumented windows internals, and not portable
between windows versions. more-or-less working implementations of `fork()`
were existed at least since NT3 era, but nobody considered 'em as more than
a PoC, and even next service pack can break everything.

it is higly depends of undocumented windows internals, and not portable
between windows versions. more-or-less working implementations of
`fork()` were existed at least since NT3 era, but nobody considered 'em
as more than a PoC, and even next service pack can break everything.

I'm wondering what Windows 10 is using to implement "fork" for Windows
Subsystem for Linux. If it's using these internal functions or something
else.
--
/Jacob Carlborg

it is higly depends of undocumented windows internals, and not
portable between windows versions. more-or-less working
implementations of `fork()` were existed at least since NT3 era, but
nobody considered 'em as more than a PoC, and even next service pack
can break everything.

I'm wondering what Windows 10 is using to implement "fork" for Windows
Subsystem for Linux. If it's using these internal functions or something
else.

It wouldn't surprise me to learn that it was a posix layer specific
syscall, meaning we can't from a native Windows process.

it is higly depends of undocumented windows internals, and
not portable between windows versions. more-or-less working
implementations of `fork()` were existed at least since NT3
era, but nobody considered 'em as more than a PoC, and even
next service pack can break everything.

I'm wondering what Windows 10 is using to implement "fork" for
Windows Subsystem for Linux. If it's using these internal
functions or something else.

It wouldn't surprise me to learn that it was a posix layer
specific syscall, meaning we can't from a native Windows
process.

The Windows Subsystem for Linux is build on a new form processes
called
picoprocesses. There's a whole API build specifically to service
WSL,
that's not otherwise available (AFAIR) for security reasons to
normal processes.
I highly recommend watching this talk:
https://www.youtube.com/watch?v=36Ykla27FIo and browsing through
this repo: https://github.com/ionescu007/lxss which reveals many
interesting details about that part of Windows.
I have watched that talk a while ago and maybe I have
misremembered something, but my understanding is that using the
WSL infrastructure is off limits for normal Win32 processes and
as such is not suitable for implementation of CoW pages for D's
GC.
(I watched that talk specifically because I was interested if
some of that could be used in druntime.)

He mentioned the "fork trick", which I assume refers to how Linux's
implementation of fork() uses copy-on-write rather than immediately
duplicating the parent process' memory structures. There was a D1 GC
some time ago that depended on this behaviour to speed up the collection
cycle.

and it was even ported to D2, and worked. sadly, using `fork()` has it's
own set of problems -- `fork()` itself is in no way a flawless expirience.
like you can fork while other thread is inside glibc's `malloc()`, and
BOOM! alot of glibc is locked forever, as `malloc()` lock is never released
in child process. some other libraries may try to intercept `fork()` to
do unnecessary "cleanup", and so on.
so using "forking GC" require alot of discipline in coding and library use,
or it will be an endless source of heisenbugs.
new linux kernels got userfaultfd API (so code can simply `select()` on fd,
and process protection violation from `mprotect()` without tricks with
signals), but... to much of my joy and hapiness, the proposed API was just
fine to create GC with mprotect barriers, and the final API that was
included gladly omited that exactly necessary API call which allows to make
it happen. great work, yeah. it may changed since then, tho, i didn't
rechecked.

He mentioned the "fork trick", which I assume refers to how
Linux's
implementation of fork() uses copy-on-write rather than
immediately
duplicating the parent process' memory structures. There was
a D1 GC
some time ago that depended on this behaviour to speed up the
collection
cycle.

and it was even ported to D2, and worked. sadly, using `fork()`
has it's own set of problems -- `fork()` itself is in no way a
flawless expirience. like you can fork while other thread is
inside glibc's `malloc()`, and BOOM! alot of glibc is locked
forever, as `malloc()` lock is never released in child process.
some other libraries may try to intercept `fork()` to do
unnecessary "cleanup", and so on.

Since we are in control of what child does I see this as no
issue. Just call mmap and do bump a pointer allocation.

My take on D's GC problem, also spoiler - I'm going to build a
new one soonish.
http://olshansky.me/gc/runtime/dlang/2017/06/14/inside-d-gc.html

[...]
Very interesting indeed!
One question about killing the no interior pointer attribute:
would this be problematic for 32-bit platforms? And if so, what
do you plan to do about it? Keep the current GC as
version(32bit) and your new version as version(64bit)?

Yeah if said 32-bit application makes use of no interior pointer
attribute then using old gc is an option. I have no plans for
this broken attribute.

One (potentially crazy) idea that occurred to me while reading
your post is TLS allocations. I haven't thought through the
details of how this would interact with the existing language
yet, but would it make sense for some allocations that you know
will never be shared across threads to be allocated in a
thread-local pool instead of the global pool? I.e., in addition
to the global set of memory pools you also have thread-local
memory pools. Then you could potentially run collections
per-thread rather than stop-the-world.

This needs spec updateon interaction between TLS and shared, in
particular the current trend of lock + cast away shared is
problematic. Also the implicit cast to immutable of a result of
unique expression.

What is it about Windows that makes you call it a distant
possibility? Is it just that you are unfamiliar with it or is
there some specific OS level feature you plan on needing?

He mentioned the "fork trick", which I assume refers to how
Linux's implementation of fork() uses copy-on-write rather than
immediately duplicating the parent process' memory structures.
There was a D1 GC some time ago that depended on this behaviour
to speed up the collection cycle. AFAIK, Windows does not have
equivalent functionality to this.

To the best of my knowledge all of D's current target OSes
support this save for Windows.

My take on D's GC problem, also spoiler - I'm going to build a
new one soonish.

Looks like I'm not the only one itching to have a go at D's GC :)
This will very likely be my DConf 2018 project. However, I have
slightly different plans:
- The GC should be usable as a library (mainly to facilitate
testing).
- Support for all platforms D already supports from the start.
- Use design-by-introspection when applicable and
design-by-contract elsewhere to split the design into modular
components.
- Make the GC configurable (using policies) and swappable at
runtime. (No need to get clever, just treat previous
implementation's pools as opaque void[]).
- Support concurrency on Windows via anonymous memory-mapped
files.
- Support generational collection using write barriers
implemented through memory protection.
- Integrate existing GC work - don't reinvent the wheel.
- More, much more debugging facilities! Integrate Diamond and
Valgrind interoperability.
- Gray-marking and compacting.
- Still need to look at immix.
I have some past work that I'd like to integrate (an experimental
generational GC I wrote like 9 years ago for D1, Diamond, and
Valgrind integration I have in a fork somewhere.)

My take on D's GC problem, also spoiler - I'm going to build a
new one soonish.

Looks like I'm not the only one itching to have a go at D's GC
:) This will very likely be my DConf 2018 project. However, I
have slightly different plans:

I see no problem in eventually uniting our efforts.

- The GC should be usable as a library (mainly to facilitate
testing).
- Support for all platforms D already supports from the start.
- Use design-by-introspection when applicable and
design-by-contract elsewhere to split the design into modular
components.

Nice. A pool could have many different structures, the collector
could then introspect on that. Sadly this almost doubles the
effort so I will not go there.

- Make the GC configurable (using policies) and swappable at
runtime. (No need to get clever, just treat previous
implementation's pools as opaque void[]).
- Support concurrency on Windows via anonymous memory-mapped
files.

Yeah I recall Rainer and myself discussing this approach, it had
some downside such as you need to remap each pool individually.
Still doable.

Super slow sadly. That being said I belive D is just fine without
generational GC. The generational hypothesis just doesn't hold to
the extent it holds in say Java. My hypothesis is that most
performance minded applications already allocate temporaries
using region allocator of sorts (or using C heap).

- Gray-marking and compacting.
- Still need to look at immix.
I have some past work that I'd like to integrate (an
experimental generational GC I wrote like 9 years ago for D1,
Diamond, and Valgrind integration I have in a fork somewhere.)

Super slow sadly. That being said I belive D is just fine without
generational GC. The generational hypothesis just doesn't hold to the
extent it holds in say Java. My hypothesis is that most performance
minded applications already allocate temporaries using region
allocator of sorts (or using C heap).

[...]
FWIW, here's a data point to the contrary:
One of my projects involves constructing a (very large) AA that grows
over time, and entries are never deleted. The AA itself is persistent
and lasts until the end of the program. Besides the AA, there are a
couple of arrays that also grow (more slowly) but eventually become
unreferenced. Because of the sheer size of the AA, I've observed that
GC collection cycles become slower and slower, yet most of this extra
work is completely needless, because the only thing that might need
collecting is the arrays, yet the GC has to mark the entire AA each
time, only to discover it's still live.
After some experimentation I discovered that I could get up to 40-50%
performance improvement just by calling GC.disable and scheduling my own
GC collection cycles via GC.collect at a slower rate than the current
default setting.

From this, it would seem to me that a generational collector would have

helped, since most of the AA will eventually migrate to older
generations and most of the time the GC won't bother marking/scanning
those parts. Of course, this is only for this particular program, and I
can't say that this is typical usage for D programs in general. But I
think D would still benefit from a generational collector.
T
--
What did the alien say to Schubert? "Take me to your lieder."

[...]
FWIW, here's a data point to the contrary:
One of my projects involves constructing a (very large) AA that
grows over time, and entries are never deleted. The AA itself
is persistent and lasts until the end of the program. Besides
the AA, there are a couple of arrays that also grow (more
slowly) but eventually become unreferenced. Because of the
sheer size of the AA, I've observed that GC collection cycles
become slower and slower, yet most of this extra work is
completely needless, because the only thing that might need
collecting is the arrays, yet the GC has to mark the entire AA
each time, only to discover it's still live.
After some experimentation I discovered that I could get up to
40-50% performance improvement just by calling GC.disable and
scheduling my own GC collection cycles via GC.collect at a
slower rate than the current default setting.

From this, it would seem to me that a generational collector
would have

helped, since most of the AA will eventually migrate to older
generations and most of the time the GC won't bother
marking/scanning those parts. Of course, this is only for this
particular program, and I can't say that this is typical usage
for D programs in general. But I think D would still benefit
from a generational collector.

Interestingly the moment you "reallocate" to expand the AA it
will be considered a new object. Overall I think your case is
more about faulty collection heuristics, that is collecting when
there is a slim chance of getting enough of free space after
collection.

Interestingly the moment you "reallocate" to expand the AA it will be
considered a new object.

[...]
This is not entirely true. The *table* itself will of course get moved
to a new object, but most of the size of the AA comes from its entries,
and those are nodes that stay in-place. You'll still have to scan
references to the table, of course, but that's a lot better than
scanning all the entries as well.
T
--
The diminished 7th chord is the most flexible and fear-instilling chord. Use it
often, use it unsparingly, to subdue your listeners into submission!

My take on D's GC problem, also spoiler - I'm going to build a
new one soonish.
http://olshansky.me/gc/runtime/dlang/2017/06/14/inside-d-gc.html
---
Dmitry Olshansky

Good overview, however:
the binary search pool lookup is used because it naturally
supports variable sized pools.
IMHO, simply concluding "A hash table could have saved quite a
few cycles." glosses over the issue of handling variable sizes.

My take on D's GC problem, also spoiler - I'm going to build a
new one soonish.
http://olshansky.me/gc/runtime/dlang/2017/06/14/inside-d-gc.html
---
Dmitry Olshansky

Good overview, however:
the binary search pool lookup is used because it naturally
supports variable sized pools.
IMHO, simply concluding "A hash table could have saved quite a
few cycles." glosses over the issue of handling variable sizes.

Pools are granular to 256kb irc, so the trick is to keep them
256kb aligned in memory. Then a map from 256kb chunks to pools is
easily created.
---
Dmitry Olshansky

My take on D's GC problem, also spoiler - I'm going to build a new one
soonish.
http://olshansky.me/gc/runtime/dlang/2017/06/14/inside-d-gc.html

"...the dubious optimization of no interior pointers..."
this is the ONLY (i emphasise it!) way i were able to make my e-mail and
irc clients to not leak memory, and keep using GC. on 32-bit systems false
pointers *is* a problem, and NO_INTERIOR really helps.
turning NO_INTERIOR into something dog-slow (or noop) will make D unusable
on 32-bit systems for anything more complex than helloworld and throwaway
scripts. particularly, any app that should work for weeks or monthes
without restart (yep, i want my mail client to Just Work, and i'm not
rebooting my PC that often) will be *forced* to ditch GC.
while NO_INTERIOR requires some coding discipline, it is invaluable in IRL apps.

"...the dubious optimization of no interior pointers..."
this is the ONLY (i emphasise it!) way i were able to make my e-mail and
irc clients to not leak memory, and keep using GC. on 32-bit systems
false pointers *is* a problem, and NO_INTERIOR really helps.
turning NO_INTERIOR into something dog-slow (or noop) will make D
unusable on 32-bit systems for anything more complex than helloworld and
throwaway scripts. particularly, any app that should work for weeks or
monthes without restart (yep, i want my mail client to Just Work, and
i'm not rebooting my PC that often) will be *forced* to ditch GC.
while NO_INTERIOR requires some coding discipline, it is invaluable in
IRL apps.

You need to move to 64bit. Apple is already deprecating support for
32bit apps and after the next version of macOS (High Sierra) they're
going to remove the support for 32bit apps.
--
/Jacob Carlborg

My take on D's GC problem, also spoiler - I'm going to build a
new one soonish.
http://olshansky.me/gc/runtime/dlang/2017/06/14/inside-d-gc.html
---
Dmitry Olshansky

Many thanks for your efforts Dmitry :)
May I ask you if you plan to make a soft real-time GC similar to
the one implemented in the Nim language ?
https://nim-lang.org/docs/gc.htmlhttps://nim-lang.org/docs/intern.html#debugging-nim-s-memory-management
What is great about it is that we can call it regularly to
collect memory a bit at a time, giving it a maximum delay for
this operation.
Being able to manually specify the maximum GC delay is what makes
Nim compatible with game development, as collections can be made
iteratively, and on a per-thread basis.
In the worst case, we know that just one of the application
threads will be delayed for a few milliseconds between two frame
renderings, which is generally acceptable for games and other
similar applications.
Moreover this opens to opportunity to call the GC only in the
main menu or the pause menu for instance, but not during actual
gameplay, so that even these few lost milliseconds will always
remain unnoticed.
This is probably why Nim's author was once paid to wrap an open
source game engine (Urho3D), and improve the language's native
compatibility with C++ libraries.
https://forum.nim-lang.org/t/870

My take on D's GC problem, also spoiler - I'm going to build a
new one soonish.
http://olshansky.me/gc/runtime/dlang/2017/06/14/inside-d-gc.html
---
Dmitry Olshansky

FYI, we've tried to improve the binary pool search, but there
aren't many pools and it's quite hard to beat.
A hashtable for a pages in the address range is too big.
I'd like to replace all of those separate pools types with a
single page heap, similar to what TCMalloc is using.
http://goog-perftools.sourceforge.net/doc/tcmalloc.html
http://jamesgolick.com/2013/5/19/how-tcmalloc-works.html
There was also https://github.com/dlang/druntime/pull/801 which
got reverted.
One problem that you'll run into with a Thread cache is
synchronizing GC attributes.
In the stalled work on a thread-cache for the current GC. Using
single-reader single-writer queues to would've been an option
there to reduce contention.
https://github.com/MartinNowakhttps://github.com/dlang/druntime/compare/master...MartinNowak:gcCache#commitcomment-16202536

My take on D's GC problem, also spoiler - I'm going to build a
new one soonish.
http://olshansky.me/gc/runtime/dlang/2017/06/14/inside-d-gc.html
---
Dmitry Olshansky

FYI, we've tried to improve the binary pool search, but there
aren't many pools and it's quite hard to beat.
A hashtable for a pages in the address range is too big.

Doesn't have to be for pages. Pool granularity is 256k, aligning
the pools at this boundary is enough. On x64 pool granularity
could be enlarged.

I'd like to replace all of those separate pools types with a
single page heap, similar to what TCMalloc is using.
http://goog-perftools.sourceforge.net/doc/tcmalloc.html
http://jamesgolick.com/2013/5/19/how-tcmalloc-works.html

Right now this leads to some inflation of RSS cause previously
used and now freed pages can only be reused when the whole pool
(e.g. 4MB or 16MB) is free again.
It doesn't seem sensible to reserve 16MB only for big (>PAGESIZE)
allocations. In particular once the pages are dirty and mapped,
you'd rather want to make use of them.