Nice newsreader! Fast and does what it needs to do, and written in D. I
like that. :)
I'm currently writing an NNTP web frontend (reading and posting) for my
university. However it's written in PHP so it's not really fitting for a
new D homepage. But I'm curious how you do web programming with D. Do
you use CGI? How do you do all the HTTP stuff (parsing form data, etc.)
and templating?
But back to the NNTP reader:
# HTML formating
The work you put into formating messages as HTML is impressive. The
autodetection of source code could really come in handy. I found
[Markdown][1] to work relatively well with common Mails so the syntax
might contain a few good ideas for e.g. quotes, links, lists, etc.
[1]: http://daringfireball.net/projects/markdown/syntax
# Topic list
Right now you display the newest few messages on the newsgroup. Building
a topic list gets quite a bit more complex. To get a proper topic list
with pagination etc. I query the overview information of all (!)
messages in a newsgroup with the "over" command (the digitalmars.com
server supports the older "xover" which is the same). This contains the
message ID and the references header which can be used to built a
message tree. All messages on the root level of the tree are topics and
it's easy to get the number of replies and the latest reply. It a bit
tricky sometimes but all other algorithms I came up with tend to lose
some messages (e.g. of the topic post is deleted) or were even slower.
The overview also contains the subject and from header and some other
useful stuff. I suppose the current newsreader does something similar
without caching and this might be the reason why it is so slow.
This message tree and the overview information however can be cached
very easily. The tree can also be extended on the fly, e.g. check for
new messages with the newnews command and add them to the tree. This
might require some locking but at least in PHP flock() was sufficient
for that.
# Cache invalidation
The problem with the message tree cache or cached messages in general is
the invalidation. Looks like the digitalmars news server does not delete
that much messages so this might not be much of a problem. How do you
handle this right now?
# D website
I took a look at your current version of the D website
(http://arsdnet.net/d-web-site/). I really like the layout. Looks good
to get started with D. Just two small things:
- The compile and run button is a bit of a security risk. I was able to
read the /etc/passwd file for example. Maybe it's possible to lock down
the compiled binaries with SELinux. Denial of service attacks (e.g.
endless loops) might still be a problem though. We built an "online D
compiler" for a presentation at our university but didn't published it
because of these concerns.
- If you only display mails in the announcements which do not have a
"References" header you will only get mails that started a new topic.
This will filter out replies.
If you want some help I could do some stuff. I'm a bit short on time
right now but since I'm building a NNTP reader in PHP anyway I might be
able to help out with your D NNTP reader. I can also help with HTML and
CSS stuff if you want. Support for older browsers and older IE versions
if there is much traffic with these browsers or some minor design stuff
(I'm not that much of a designer though). I might also start to look
into SELinux…
Happy programming
Stephan Soller
On 31.01.2011 04:08, Adam Ruppe wrote:
> In the other newsgroup, I've been talking about a little
> web news program I've been writing as a spinoff of the
> potential new homepage idea.
>
> It's to the point where it is usuable, but still kinda buggy:
>
> http://arsdnet.net/d-web-site/nntp/thread-index?
> newsgroup=digitalmars.D
>
> Source code: http://arsdnet.net/d-web-site/nntp.d
>
> NOTE: it does /not/ automatically check for new posts. I have
> to manually trigger that right now (I don't want it annoying
> the news server automatically while still in the testing phase.)
>
> It will lazily load a message on demand though if you know
> it's message ID:
> http://arsdnet.net/d-web-site/nntp/get-message
>
> Get it from the Message-ID header in the post.
>
>
>
> Anyway, here's the features:
>
> a) It isn't god awful slow. The PHP web news currently on digital
> mars, as best as I can tell, actually polls the news server every
> time you go to it's index! This does aggressive local caching.
>
> b) It actually lets you select text...
>
> OK, if I list every annoyance with the current web news, I'll
> never stop. Moving on to new things:
>
> c) It tries to convert news posts to HTML, so the paragraphs
> wrap to the browser, links work, quotes are put into the proper
> tags for indentation, and it tries to auto-detect D code and
> put it in a<pre> block - which my javascript can make inline
> editable and runnable. Example:
>
> http://arsdnet.net/d-web-site/nntp/get-message?
> newsgroup=digitalmars.D&messageId=%
> 3Cmailman.1085.1296409409.4748.digitalmars-d%40puremagic.com%3E
>
> With script disabled, you'll see the code in a different colored
> block. With script enabled, you'll see an Edit button there
> too.
>
> d) It tries to convert HTML emails back to plain text. (Ironically,
> so it can turn it back to html...) This gives uniformity across
> the various mime types. Similarly, if the type is
> multipart/alternative, it will only show the text version.
>
> e) It also makes an attempt to preserve deliberate whitespace,
> for things like ASCII art or purposefully short lines. If it
> can't make heads or tails of it, it bails out and shows the
> original message in a<pre> block for human consumption.
>
> f) Tries to be fast and lean.
>
> g) Written in D!
>
> h) Already read messages is tracked by your browser - if the link
> is visited, it puts up a different color url.
>
> Coming as I find time:
>
> a) References to bugzilla entries should be automatically
> converted to links.
>
> b) Viewing threads by date or by threaded view.
>
> c) Posting with the option of automatic quoting.
>
> d) Syntax highlighting of D code in posts.
>
> e) Maybe, maybe links to documentation of functions referenced,
> if I can find a good way to get them automatically. Integration
> with my dpldocs.info site is the way I'd do it.
>
> e) Any more ideas? I'm reluctant to add too much, but if I like
> an idea - or if you want to write the code :) - I'll be open'
> to adding it.
>
>
> Known bugs:
>
> Lots of content types aren't handled right and it ignores
> character encoding.
>
> It doesn't always recognize code. This would be ok, but if it
> sees one line as code but doesn't include one of them, it would
> confuse the reader. Example:
>
> http://arsdnet.net/d-web-site/nntp/get-message?
> newsgroup=digitalmars.D&messageId=%3Cii4lbj%242bes%241%
> 40digitalmars.com%3E
>
> (Look for "auto str =")
>
> The reason for this is it detects code lines by looking for
> semicolons and open braces. It will call something a generic
> <pre> if there's a lot of whitespace in it - figuring it is
> probaby ascii art (if it thinks the whitespace has human
> significance, it tries to preserve it), but it still isn't
> a perfect detection function.
>
> I'm open to ideas. We want to detect code, but not flag
> regular English text.
>
>
>
> I'm also open to graphical styling ideas. I put up a dark
> theme here because the white was hurting my eyes, but I change
> on if I like light or dark almost at random. (Depends on the room's
> lighting conditions I think). But I didn't do any more graphic
> setup other than the max-width.
>
> Multiple color schemes is an idea I like.
>
>
>
> BTW, as a fun fact, this post is about 1/4th the size of the
> entire nntp.d code file!

> > Strange thing is, most functions are properly demangled but 2
> > aren't.
> > Is this a (known) bug?
>
> Yes, core.demangle can't do some symbols because DMD applies
> a one-way hash to them once they reach a certain length because
> such long symbols tend to break linkers.
Ah I see, but what about the short one:
_D4arsd3web3runFC4arsd3cgi3CgiPS4arsd3web14ReflectionInfoZv

> But I'm curious how you do web programming with D. Do you use CGI?
Yes, for most my apps (some have a homegrown HTTP server they use
instead, if persistence is necessary).
The module is here:
http://arsdnet.net/dcode/cgi.d
That same module works with standard CGI and with the embedded
http server, just with different constructors. The default one
reads CGI variables, and the alternative takes http header
and body fed to it from the network class.
> How do you do all the HTTP stuff (parsing form data, etc.)
You can see in the code that it's pretty straightforward. With
the CGI standard, the webserver passes you data through stdin
and environment variables.
For GET and COOKIE variables, you check the relevant environment
variable (QUERY_STRING and HTTP_COOKIE, respectively), then url
decode them and use the resulting string arrays.
For POST, you first check the CONTENT_TYPE and CONTENT_LENGTH
environment variables, then pull in data from stdin (same as any
simple program, except you know the length you want too).
The content type can be one of many options. Regular forms
are x-www-url-encoded (or something like that) and you decode
them identically to the query string.
My class puts them into an associative array, similar to PHP:
immutable string[string] get;
immutable string[string] post;
immutable string[string] cookies;
Names can also be repeated in a web form. PHP does this with a
naming convention: if you put [] after the name in the form, it
loads up a dynamic array in the field. So name="mything[]", repeated,
becomes $_POST["mything"], which is an Array.
I did it differently - there's simply an alternative variable to
access them:
immutable string[][string] getArray; // ditto for post
The names are preserved from the form exactly. This is the lowest
level access: ?key=value is there as getArray["key"][0] == "value".
I don't try to follow PHP's convention.
(As you can see, getArray["key"][0] is always usable. But since
I find this relatively rare, I also offer plain get["key"] as a
shortcut to it.)
Where PHP uses globals for this, I used class members. So you'd
actually write:
Cgi cgi = new Cgi();
cgi.get["key"];
And so on.
That handles strings, but there's other content types too. The most
common alternative is used for file uploads.
File upload forms have a content type of multipart/form-data, which
is a MIME style encoding, similar to email attachments. The
content type gives a boundary string. You search stdin for the
boundary, then read some part headers, and finally the data, ending
with the boundary string again.
This continues until you hit "--" ~ boundary.
Field names are no longer given by key=value like in urlencoding. It's
passed as a field header, after the boundary, before the content.
The original filename for file uploads is passed the same way.
The CGI class takes care of all this for you, loading up the same
associative arrays you get with a normal form. If there are files
uploaded though, you access them through:
cgi.files["name_from_form"]
Which returns an UploadedFile struct. It includes the metadata passed
along and the file's contents as a byte array. (You can expect this
wouldn't work for very large files. That's probably why PHP uses
a temporary file, but I find that such a hassle that I wanted to
avoid it. My class currently simply rejects too big files, since
I've not needed to solve that problem yet! All my apps only accept
small files to upload anyway, little spreadsheet attachments, photos,
etc., all of which easily fit in memory.)
Anyway, saving file is as simple as:
std.file.write("some name", cgi.files["myfile"].content);
You can also use the member strings filename and contentType of
that UploadedFile struct to get more info.
Writing response data back to the user's browser is a simple case
of writing things to stdout. First comes headers, then data. I
abstracted this with the class too:
cgi.write(); // write's response data, like php's echo
For headers, there's some specific functions to do it, or a generic
header() method that works just like PHP's.
cgi.setResponseLocation("/"); // does a 302 redirect
cgi.setResponseContentType("image/png"); // tell the browser a png is coming
cgi.write("hello!"); // write data
See the cgi.d file for details and more. The reason I provide these
instead of just letting the user code use writefln() or whatever
directly is:
a) isolate them from handling the headers. It isn't hard to do,
but it is easy to make mistakes and it's a bit tedious. The class
takes care of it for you every time.
b) writefln() won't work in the embedded server environment.
cgi.write, on the other hand, will. (It implements this via a
delegate passed to the constructor. It passes your data to the
delegate, which is responsible for forwarding it to the network)
Embedded server headers are slightly different than CGI headers too.
The helper functions keep these changes from affecting user code.
Switching from CGI to embedded server, if you use the class, it often
as simple as changing the constructor call, keeping the rest of the
code unchanged.
In theory, FastCGI or other protocols could be added through
additional constructors too. I haven't done this myself though
because plain old CGI is both well supported and quite fast, despite
it's reputation. (I think CGI got the blame for Perl's slowness
more than its own weaknesses. Yes, it's startup and some parts of
the output is slower than something like mod_php, but the program
itself still runs as fast as it runs. For D, that means it blows
PHP's speed out of the water.
Startup time tends to have a disproportionate impact on benchmarks,
because those benchmarks don't actually do anything interesting! Once
your program does something useful, the time it spends doing real
work will quickly outgrow the startup time, so that slight initial
delay becomes irrelevant in the overall result.)
> and templating?
I have two methods that I use together: a TemplatedDocument class,
and a plain old (well, extended) DOM style Document class.
TemplatedDocument extends Document, so everything about the latter
applies to the former.
You start with a well-formed HTML file. This might be build out
of the text of several files. e.g:
auto document = new TemplatedDocument(
std.file.readText("header.html" ) ~
std.file.readText("mypage.html" ) ~
std.file.readText("footer.html" ));
Now you have a DOM object that you can grow or modify with your
content. Building a tree with the standard DOM is tedious, but
I have some extensions to help with that.
document.getElementById("some-holder").innerText = my_name;
The innerText method, borrowed from Internet Explorer, is one
of the most useful. You can get or set plain text, with the object
taking care of encoding.
Alternatively, the HTML file might have a placeholder:
<h1>{$title}</h1>
And you fill those into the document via a simple AA:
document.vars["title"] = "My cool site & stuff";
Document vars are automatically encoded for HTML before being output.
The only way to put raw html in the output is to use the innerHTML
DOM method and friends or to use the innerRawSource extension.
(HTML may try to check for well-formedness. innerRawSource just takes
your word for it and doesn't attempt to build an object tree.)
This is meant to ensure the easy way is the correct way. Explicit
encoding or decoding is rarely necessary.
My template class offers no way to loop. All it does is that
placeholder variable thing. To loop, do it yourself. One of my
pages defines a custom html tag:
<repeat times="10"> hello! </repeat>
Then you can implement it in the code like so:
foreach(e; document.getElementsByTagName("repeat")) {
string html;
foreach(i; 0 .. to!int(e.times)) // attributes are accessible like in javascript
html ~= e.innerHTML; // get the inner contents
e.outerHTML = html; // replace this tag with those contents
}
(For fancier modifications, there's also a getElementsBySelector,
letting you do CSS style loops.)
However, I'm more likely to just build those portions with the
DOM, with the template saying where it goes:
<div id="messages-holder"></div>
auto holder = document.getElementById("messages-holder");
foreach(message; messages) {
holder.addChild("p", message); // a shortcut method to create a child, set
its text, and append it all in one
}
Some people believe this is no better than putting html output
in your code as strings; it basically does the same thing. But
I don't agree this is a big problem:
a) You can still keep it separate with functions. See nntp.d for
an example of this in practice. getMessage() returns a Post object.
getMessagePage() takes a Post object and returns a Document.
(Note this is handled automatically by the FancyMain mixin, defined
in web.d, used in nntp.d. I've been mostly describing the lower level
classes in this post: cgi.d and dom.d. web.d builds upon them to
automate a lot of common tasks and to try to force more MVC
separation.
For an example of why this is cool, check this out:
http://arsdnet.net/d-web-site/nntp/get-message?newsgroup=digitalmars.D&messageId=<ii4993%241l5b%241%40digitalmars.com>&format=json
See the "&format=json" at the end? It outputs that message object
as json instead of the HTML page! There's no code in nntp.d
to do this - it's handled automatically by web.d. There's a
variety of formats available. Try table, xml, string, and html too. They
don't all work as well here because Post is a class rather than
a struct (web.d currently works much better with simple structs than
with fancy structs or classes), but in the future or in other
projects, they would work too.)
Back to templates in general, I also think HTML itself really
isn't layout nor style. Layout is done by the skeleton html files,
not the in code dom, and style is done with CSS.
For example, nntp.d doesn't do inline styles. It just describes
the data with tags and attributes, letting CSS finish the job of
colors and other such details.
My large work project currently has 6 skins for it, all written
independently of the code. None of the D had to be changed, despite
me using the dom to build out content loops.
The dom extensions also save huge amounts of time. Take this:
auto form = cast(Form) document.getElementById("my-form");
foreach(k, v; cgi.post)
form.setValue(k, v);
That takes all the POST variables and sets them in the given form.
Whether the form is built out of inputs, radio boxes, selects,
textareas - it doesn't matter. The Form class abstracts it all
away to a uniform interface.
(If it doesn't find a matching field in the given HTML, it
automatically appends an <input type="hidden"> with the given
values.)
Not having to write:
<input type="text" name="something" value="<?= $something ?>" />
Oops, I didn't htmlEntitesEncode that something, XSS time!
<select name="Something">
<option value="a" <?php if($a == "a") echo "selected=\"selected\"";?>>My
Option</option>
<option value="b" <?php if($a == "b") echo "selected=\"selected\"";?>>My
Option</option>
</select>
Sucks ass compared to form.setValue("Something", "a");
That alone infinitely outweighs any counter argument I've heard to
my use of of a customized DOM.
I've been typing this for a long time, I'm going to break up comments
on the rest of your post into a separate message.

Stephan Soller wrote:
> Cache invalidation
> How do you handle this right now?
I don't. My program assumes that once it has a message, it never
needs to look to the server for it again.
(This is probably because of my own experience with mailing lists -
I use the mailing list interface to the newsgroup for reading. With
them, once the email is sent, it isn't going to change. I just assumed
the newsgroup worked the same way...)
> D website
> I really like the layout.
The credit for that goes to Christopher Bergqvist. See the thread
"Suggestion: New D front page" in the main newsgroup. He posted
a png outlining his idea and I just ran with it :)
> The compile and run button is a bit of a security risk. I was able
> to read the /etc/passwd file for example.
Yeah, but that's normal on a multi user linux system. It doesn't
really break anything.
But, I moved the compile and run program to a separate VM to
further limit it. If you read that entire filesystem, it doesn't
really matter - it's an out of the box Slackware install. There's
nothing sensitive or private on it at all.
(Like it's domain name says, it is completely expendable info!)
> Denial of service attacks (e.g.
> endless loops) might still be a problem though.
I think this is solved with my use of setrlimit. If a process
eats more than 5 seconds of CPU time, the operating system kills it.
The limits are also set to 16 MB of RAM, 16 kb files, 3 forks,
and a bunch of other things.
(This might be interesting to test some programs - it will actually
get out of memory exceptions pretty easily!)
Write access is also limited to a single directory, in addition
to that individual size limit. Filling up the disk shouldn't
be possible.
The operating system firewall prevents most network activity, incoming
and outgoing. You can play with sockets, but only if they are working
with localhost, and even then, they aren't allowed to access the
ssh port.
Running a spam bot off it is impossible.
More than this, the VM is also limited. I set its memory and CPU
limits to about 1/5 the resources of the physical server. So if
you did manage to get root and max out your program, it won't
have a significant impact on the other things running with it (all
low traffic websites). An external firewall serves as layer 2 to
protect against spambots.
Finally, I did a VM snapshot after setting it up. I'm considering
running a scheduled script on my computer to blank and reset that
VM every night. Then, if you got root and worked around my other
restrictions, it'd be a temporary victory anyway, just until I
revert the snapshot again.
All in all, I think I have a pretty safe setup. If I'm proven
wrong, plan B is to use the ideone API instead.
> If you only display mails in the announcements which do not have a
> "References" header you will only get mails that started a new topic.
> This will filter out replies.
Yes, that's what I wanted. The idea is to show a feed of new things
coming out, rather than new replies on old ideas. This way, the
homepage shows the most variety.
> Happy programming
Thanks! If I have any questions, I'll be sure to ask. I've gotta
get back to my real work soon though (stupid Monday) so finishing
this will probably have to wait until next weekend.

Adam Ruppe Wrote:
> foobar wrote:
> > 1. common human markup such as: _foo_ (underline), *foo* (bold) etc,
>
> Yeah, that's a pretty good idea. I agree with the others that it
> should keep the text symbols (especially since I've seen these
> algorithms wrongly flag things *a lot*) but a basic implementation
> is ok.
>
> > 2. parse BBCode.
>
> This probably isn't a good idea... unless it is a web input only
> filter.
>
> So posts pulled off the news server are treated as plain text - no
> BBCode parsing is attempted. But posts made through the website
> may be parsed, and converted to plain text before being forwarded
> to the news server. (Note that I use my beloved mutt mail client
> for reading the newsgroups myself, so anything that would break
> plain text email browsing is a no.)
>
> I already have pretty decent bbcode -> html and html -> text
> functions in my bag of toys, so regular participants never need
> to know what kind of input was used.
>
> It would let web users feel more at home without impacting
> everyone else.
>
>
> The only downside I see is if people think bbcode is accepted,
> someone might write it in their newsreader or email client, where
> it won't be parsed. I don't want the groups to get filled up
> with bizarre markup everywhere, but, the kind of users who use
> email clients and newsreaders probably won't make that mistake
> anyway.
>
>
> So yeah, let's give it a try for web posting and see if it works out.
Just to clarify, I don't want text posts to be filled with lot's of markup either.
BBcode was just an example of a light-weight markup which is familiar to web based forum users. other options could be markdown and restructured-text. Basically whatever is light weight enough to not bother text mode users and is also useful enough when parsed by your web reader to convert code into those awesome "compile & run" boxes.
We could also support just a tiny subset of BBCode (just the [code] tag), so that code snippets would be identified without a fuzzy guessing algorithm.

Adam Ruppe wrote:
> In the other newsgroup, I've been talking about a little
> web news program I've been writing as a spinoff of the
> potential new homepage idea.
That is great news. I've been wanting to do one for years! I haven't looked much
at yours yet, but here's my ideas anyway :-)
1. Can use web interface or nntp interface
2. web interface looks sort of like reddit, i.e. all posts on a thread
3. users can post anonymously
4. web interfaces supports logins - logged in users can vote up or down on posts
5. web interface can mark posts as read or unread - fixing my beef with reddit
that there's no reasonable way to scan a thread for new posts
6. an easy way for moderators to delete spam
7. runs on 64 bit FreeBSD (what the Digital Mars server runs on), yes, I know
that means I have to get 64 bit dmd on FreeBSD working!
I can contribute the code that generates the D archive pages from the news postings.

8. Search functionality
digitalmars uses google for searching the NG archive, but I've no idea
how to do custom searches. I.e. I'd like to search for a keyword in
the topic title only, how would I do that?