Bumping unwanted translations

To bump a translation is to move an open database record to the uploader's local database. Thus, anonymous translations become private, and public translations become personal. To unbump is to undo that move.

Bumping is desirable when an unwanted anonymous or public translation is found. By bumping, the uploader is still able to access his own translation but other users will not see it. (Unwanted private translations can be handled via moderation and sanctions; unwanted personal translations can be edited or deleted.)

Alice sets up a visitlink at /newmarks/A/Hello,-world
Carol uploads a translation to it, which lives at /xlates/C/Hello,world/en/public/31

In this arrangement:

Alice may bump and unbump Carol's translation

Alice's SuperOrdinate may bump and unbump likewise

If Alice's SuperOrdinate assigned another subordinate bump moderator privileges, that user may also bump and unbump.

Carol may bump her own translation, but may not unbump it: she would have to delete and recreate it.

Carol's SuperOrdinate can likewise bump, but not unbump

If Carol's SuperOrdinate assigned another subordinate bump moderator privileges, that user may also bump but again not unbump.

All other users may neither bump nor unbump.

Carol (and so her SuperOrdinate and bump moderator) may not unbump in order to prevent her from undoing Alice's bump. The visit link belongs to Alice, so Alice gets to say what stays and what goes.

That said, Carol may have legitimate reasons for her translation which is why Alice may not edit or delete it.

CORS - Cross Origin Resource Sharing

A browser can be told by website A to obtain resource B from server C. This is such a security hole that modern browsers will protect their user by refusing to undertake such a request unless server C specifically says it's prepared to do it. CORS is the protocol by which servers indicated that preparedness.

The csi18n service supports CORS. If you want to serve webpages that have scripts that directly connect to the csi18n server, you'll need to keep reading. If your scripts connect back to your same website for cached copies, you can likely skip this section.

How CORS works

Website A above is known as the origin. When a browser detects a script is about to connect to a different website than to its origin, that request is sent with an additional header, Origin: , with the origin's address as a value.

Server C, if it sees Origin: , looks to see if the website listed is permitted to access the given resource. If it is, that permission is shown in an additional response header, Access-Control-Allow-Origin: , again with the origin as a value.

If server C permits Origins to access that resource, just not the one requested, that response header is not added. If server C doesn't understand CORS, that response header is also not added.

If the response header Access-Control-Allow-Origin: is not present, the browser knows that server C cannot or will not authorise the exchange, and so prevents the script from obtaining the response.

What actually happens is much more complex than this particularly with the PUT and DELETE methods, so be encouraged to visit the W3C recommendation.

Using CORS with csi18n

You must make the csi18n server aware of which origins are acceptable before your scripts will be authorised for cross-origin requests.

CORS currently permissions access to all resources across the system, except for the root /

Origins are permissioned only for visits to your visit and translation links: other people cannot permission to your links, nor can you permission to theirs

403 Forbidden is currently returned for any request that fails a CORS request, whether or not it would have succeeded otherwise. However, the CORS documentation requires you use the presence of Access-Control-Allow-Origin: with your domain to determine if a CORS request has failed.

Case-sensitivity: for an origin to match, it must match sensitive to case, so http://example.com would not match http://Example.com

Note that the Origin https://rest.mpsvr.com is always acceptable.

Anonymous access

Translations that have been uploaded with 'anonymous' visibility are available without need for username or password. You should still use X-APIKey:, and you should still permission your domain at /subscribers/me/CORS.

/subscribers/me/CORS

To access the service from browsers connecting to your own website you must create a cors-origin record at the csi18n server. A cors-origin record tracks the subscriber ID used in your visit and translation links, and the domain from which the request is expected. You will need to add each domain that might generate a match, including * and null.

Thus, if you are hosting a script at http://news.blogs.cnn.com/2007/01/20/hello-world, and it tries to access your translation at https://rest.mpsvr.com/7/Example-Hello-World, the cors-origin record will need the domain http://news.blog.cnn.com and the visitsid 7.

Wildcards

Only the "*" wildcard will match all domains presented. In the example above, the domain http://*.blogs.cnn.com would not match news.blogs.cnn.com. Rather, it is treated literally: if the user-agent sends http://*.blogs.cnn.com, then it would match. Note that browsers ought never do this, as http://*.blogs.cnn.com is not a valid Fully Qualified Domain Name as per RFC952

The "null" domain is also supported. If a browser cannot tell the origin of the request - for instance, if the request stems from a webpage on a local disk (common for developers) - then it will send Origin: null. If you want to make use of null domains, you will need to upload it like any other.

Storyboard 04 walks through the creation and use of cors-origin records.

Preflight checks

A preflight check is successful if the URI matches a resource that could exist. As such, a successful preflight check cannot tell you anything about whether the resource does exist.

Preflight checks - as with all CORS requests - must be made with an Origin: header

The OPTIONS method must also be used; use of any other method would start an actual request, not a preflight check

Preflight checks may be made without Authorization: and X-APIKey: headers

Failing preflight checks earn a 403 Forbidden

Successful preflight checks earn a 200 OK

Do not use these 403 or 200 HTTP responses to determine success: the standard requires you use the presence or absence of Access-Control-Allow-Origin: along with the expected origin address as its value.

POSTing duplicates and CORS

In the normal course of http, if you POST the same representation twice, you will only create one resource on the csi18n server. The first POST gets a 201 Created response and the second would get a 301 Moved Permanently response. The second request is known to have a match already, and so that match's URL is returned in a Location header.

CORS disallows using 3xx codes this way.

If you submit two POSTs using CORS, the server will notice this and return 201 Created each time. On the second and subsequent occasions, the server also returns a new header:

X-CORS-201-not-301: Status code would have been 301 if not a CORS request

If your script needs to know the difference between a brand new URL and a reissue of an existing URL, please also scan for that header.

CSRF/XSRF

Cross-Site Request Forgeries are where User A's browser generates a request to URL B at Server C through the browser's visit to Server D. If User A is "logged on" at Server C, URL B can have any effect chosen by the attacker.

After research, it seems that CSRF-specific defences may not be required at the csi18n service: simple as I am, I've been unable to craft a CSRF attack that didn't rely on other, more deadly attacks.

For attacks targeting this service, my understanding of the situation is this: CSRF attacks require a browser to make the request, which it can do passively (html, css) or actively (javascript), making either idempotent or non-idempotent requests.

Idempotent requests. If the browser creates a GET request, or javascript creates GET, HEAD or OPTIONS requests, idempotence means nothing will change on the server. What data is returned to the browser may or may not make sense, but either way cannot get returned to the attacker - who thus won't know what happened. If it does create a connection back to the attacker, that would seem to be an XSS vulnerability in the rendering.

Non-idempotent requests. If the browser creates a POST request, or javascript creates DELETE, POST or PUT requests, non-idempotence means something may change on the server. However, the requests go through a browser, and as such are subject to CORS origin checks. If they don't pass, the request is not acted on. An attacker would therefore need write access to a permissioned Origin.

If an attacker has full write access to a permissioned Origin, that Origin has a bigger problem than CSRF it needs to fix.

If an attacker has XSS access on a permissioned Origin, that Origin still has a bigger problem than CSRF it needs to fix.

For attacks using this service to relay CSRFs back to a browser, this seems to be a bit easier: if the CSRF creates a clickable link, active javascript etc, then that seems to be either intentional, or an XSS vulnerability in the receiving service: the exact same representation could contain an unsafe CSRF and ought to be blocked, or be safely describing a CSRF and ought not to be blocked.

For clarity, I'm not asserting that the service can't be subverted: it does issue html, and that html can contain crafted translations that could cause CSRF issues where not sanitised and which if found should be quickly fixed. What I'm saying is that until I understand the threat more clearly, the csi18n does not add CSRF-specific counter-measures such as synchroniser tokens: idempotence, CORS and RESTful architecture seems to design out the simplest CSRF attacks. If you know better, and have the code to prove it, I'd love to see it.

If I did implement CSRF-specific counter-measures, I'd like to consider this outrageously elegant solution.

Dates/Times in CDMI format

Developers are encouraged to consider CDMI-Dates in preference to HTTP-Dates where intermediate caching is not important.

CDMI-Dates includes microseconds, as does the csi18n backend. As HTTP-Dates don't, use of these can lead to problems with distinguishing resources created within the same second.

Rationale

While use of HTTP-Date for microsecond fields would kinda sorta work, the service could not detect the difference between two resources uploaded within the same second. So, date handling was added based on the CDMI v1.0.2 document. Note that RFC-2616 suggests you use ETags for resolving these kinds of issues.

A CDMI-Date in this context has one format: "YYYY-MM-DDTHH:MM:SS:mmmmmmZ". Everywhere that HTTP-Date is used, CDMI-Date is used too. So, the header Date: has a matching X-CDMI-Date: field; where If-Modified-Since: can be used, X-CDMI-If-Modified-Since: is also available. If a client uses both in a request, the CDMI-Date will be used and the HTTP-Date silently discarded. If an HTTP-Date is used on its own, it would be converted internally to a CDMI-Date, with the microseconds set to ".000000" before use.

Dates/time in the linked list

Notice the server operates as an NTP client so it's possible that earlier elements in the linked-list could have a later date/time than following elements.

Deleting doesn't delete

Successive changes to a resource do not modify the existing resource at the file-system or database record level, but by supercede that record with a new record that includes the change.

Rationale

This policy is in place to provide a method of handling data poisoning whereby a malign actor could create, update or delete resources in the name of another user. Once aware of the compromise, the various records can be rolled back to a given point in time, even if deleted. Currently, rolling back in this way would be handled manually and in the first instance you should contact us.

Implementation

csi18n resource changes are implemented through time as linked-lists, not as isolated file-system or database records. Each record in the list is marked as superceding the last, and that record which is not superceded is taken as the "live record". An update to the resource sees a new record added to the linked-list, superceding the last, and which itself becomes the new live record. When asking for the resource, it's this live record which is returned.

When method DELETE is used on a resource, the new live record itself gets a "deleted" marker set. In this situation, the whole linked-list is silently ignored. That is, asking for a resource will see a 404 Not Found error returned, just as if asking for a record which had never existed.

In some situations, a resource which has been deleted can never-the-less be accessed. In particular, the account must have appropriate privileges (creator, moderator etc), and must also have a direct link to the record within the linked-list. Direct links are of the form /xlates/{user}/{newmark}/{lang}/{visibility}/{continuity record id}/{database record id}. The last two identifiers usually, but do not always, match for the first version of the record: https://rest.mpsvr.com/xlates/1/A-winner-is-you/en-CA/personal/336/336

Example: in a linked-list of three successive updates the last of which raised the deleted flag, a request for the live record at https://rest.mpsvr.com/xlates/1/A-winner-is-you/en-CA/personal/336 will return 404. However a request for the third item in the linked list https://rest.mpsvr.com/xlates/1/A-winner-is-you/en-CA/personal/336/336 will see the record returned even though the deleted marker was set.

Assuming you have the relevant rights, retrieving the live record will also retrieve hyperlinks to the linked-list records, and retrieving any one of those will also obtain hyperlinks to the first, last, next and previous records in the linked list.

You really, REALLY need a deleted resource, and its linked-list removed?

HTTP

Caching

If you use the X-CDMI-* headers for strong validation, please note intermediate caches will likely pass your request straight through to the origin server: they cache on HTTP dates.

Your options include using eTags instead, adapting your intermediate caches, and validating directly against the service.

At the time of writing caching hasn't actually been tested in-the-wild.

Headers

The service may supply Expires: and Cache-Control: headers.

If those headers are not supplied, then the service makes no claim to the cacheability of the supplied resource.

If the headers are supplied, and the resource should be cached, they'll arrive with an Expires: {HTTP-date} and Cache-Control: public header.

Resources that shouldn't be cached at all will arrive with no Expires: header and a Cache-Control: must-revalidate; private; s-maxage=0 header.

Chunking ambiguity

RFC2616sec4.4 asserts "All HTTP/1.1 applications that receive entities MUST accept ... "chunked" transfer-coding" having just said "the server SHOULD respond with ... 411 (length required) if it wishes to insist on receiving a valid Content-Length." This service observes the latter.

Chunking is particularly useful when the length of the body is not known before the headers are sent, for instance with streaming.

For the moment at least, the representations sent and received by the server are of such a form that the length would always be known. Thus, this service will reply with 411 Length Required if Transfer-Encoding: chunked is received without a valid Content-Length:.

Additionally, the service itself won't send back in chunked form (although I can't rule out Apache overriding it), so TE: chunked will be ignored, and the response made in identity.

As the presence of the CDMI conditional causes the service to ignore any HTTP conditionals.

Content Negotiation

RFC2612sec12 includes "Any response containing an entity-body MAY be subject to negotiation, including error responses." While the service does support content negotiation, the negotiation of error responses is not something I've seen elsewhere.

For now, Content Negotiation of error responses is confined to two situations: see Accept-Language for more details.

Header fields

Accept

Your client will use this header to describe with what media type(s) it wishes to receive data, and the relative preferences for different kinds.

For example, if it sends Accept: application/json then the service will provide the representation in a JSON form.

To say it doesn't mind which structure, send Accept: application/json, text/plain and the service will send the answer in either form.

To describe relative preference, sending Accept: application/json;q=0.4, text/plain;q=0.8 will see the service try to answer as text/plain in preference.

The service supports wildcard types, subtypes and types/subtypes as per RFC2616.

The service supports the following structures: application/xml, text/xml, application/json, application/vnd.php.serialized, text/html, text/plain. In some situations it will also support prs.mpsvr.getable and application/x-www-form-urlencoded.

On receiving the reply, examine the Content-Type: header for the media type actually used. Also note that header may contain version information, see Content-Type for details

Accept-Charset

The service will only reply using UTF-8.

Notice that this doesn't conform to RFC2616sec14.16 which >implies ISO-8859-1 must be supported. However it does conform with RFC7231sec5.3.3 which supercedes it.

If this header is absent from your request, the service will use UTF-8. If present, it must accept UTF-8 or * otherwise the service will respond 406 Not Acceptable.

Accept-Encoding

The service will only reply using identity. ("identity" asserts that no additional coding is to be used.)

If this header is absent from your request, the service will use identity. If present, it must accept identity or * otherwise the service will respond 406 Not Acceptable.

Accept-Language

Sending this header in your requests asks the service to change the language of the whole response, including headers, not just the language of the content which is returned. This is almost certainly not what you want, as this would be of use only to programmers accessing the service who don't speak English. Also, Accept-Language wouldn't usually affect the language of the content anyway.

In any case, this is usable in only two places:

The index page. The index page has content drawn direct from the service, so different content will be returned depending on this header. This is in place only to demonstrate that user-facing content could be dynamically created on the server.

Authentication problems. Some of the HTTP errors relating to authentication problems are also drawn direct from the service, so different text (but not different response codes) may be seen depending on this header. This is in place only to demonstrate that programmers without English reading skills could be supported.

In the above instance, the language drawn depends on the q= provided against each language given. If * is encountered, the first available language - if any - is used.

In all other instances, this header is ignored and any response made in English even if the content of the response is in another language, and even if English is specifically excluded by the header.

Accept-Ranges

You may use Range: for download, but you will not be able to use ranges for upload.

The csi18n service is a backend to Apache webserver. If you download a range, a full entity is provided to Apache which will itself rework the response to provide the requested range. However, when you upload a range, Apache indicates that to the service, which will issue a 416 Requested range not satisfiable response.

When making a ranged retrieval, you'll notice that the HTTP response 206 Partial content will diverge from the service response X-Testing-Dupe: 200 OK. In all cases, rely on the HTTP response.

Content-Language

This header is only sent from the server in response to a request on the index page, or following some authentication failures. When filled in, it will contain a subset of those languages in your Accept-Language request header that were used in completing the response.

When this header is received by the server, it will cause a 501 Not Implemented response. That's because the header refers to the natural language of the entity body, and the entity body is usually a data structure: it has no natural language.

Content-MD5

Content-MD5: asserts the MD5 hash of the message body starting from the the first byte following \r\n\r\n at the end of the headers, up until and including the last byte sent.

Sending it to the server is not mandatory, however if it is sent the MD5 will be checked and if it doesn't match, a 400 Bad Request response issued.

The server will always send a Content-MD5: when it sends a body.

Content-Range

Do not use.

The service will not emit this header, although Apache webserver might - see Accept-Ranges for details.

The service does not handle range uploads at this time, so if this header appears on your request it will cause a 416 Request range not satisfiable response.

Content-Type

When the server supplies a Content-Type header, the media type described by it is usually decided based on the Accept header your client sent with the request.

Version
A server-supplied Content-Type header may also include a parameter with version information which your client should reuse.

Take changing a spelling error using JSON as an example. Your client might do this as a GET of the original, then a PUT of the changes.

Notice that ";v=1.0" in the initial response is reused in the PUT. This version information is used by the service to determine, among other things, what fields are expected in the representation. If that version information is absent, then the request may draw a 400 BAD REQUEST response.

Also note that strict comparison is used, so a version of "1.0" is not the same as "1.00" or "1".

text/html replies do not normally carry version data this way. Rather, its included as a hidden entry in any form supplied.
For example the password recovery page at https://rest.mpsvr.com/subscribers/recover includes

Date

RFC2616sec1418 suggests that Date: (and so X-CDMI-Date:) represent the moment just before the entity is created, and that's what happens here. Note that Last-Modified: when following a newly created resource may be somewhat earlier than Date:. This is because the entity is the lower part of the response, not the representation it contains, nor the resource from which it was created.

Expect

Do not use.

The csi18n service makes no use of any Expect values. All values other than 100-continue will cause a 417 Expectation failed error as per RFC.

100-continue

Do not use.

This expectation provides for headers to be examined before potentially large uploads are sent.

At the moment, Apache will not pass control over to the csi18n service until all your data has been uploaded, including your entity-body. The service will only then get to look at your headers, and provide the appropriate response.

If you do use Expect: 100-continue, csi18n will emit an X-Warning-N header that the expectation has been ignored.

Supporting 100-continue may be implemented in the future.

Last-Modifed

The service makes records with microsecond granularity so it's quite possible that one record may be updated twice in the same second. That makes Last-Modified a weak validator.

X-CDMI-Last-Modified is provided for strong validation based on date, but do note the warning in Caching above.

Last-Modified is not offered on resources at and underneath /newmarks/{user}/{newmark} as to do so would cause at least four extra database look-ups to discover the timestamp for the last change. This would be necessary to discover if deleting a recent record changes effective access to an earlier record. Use etag instead.

Origin

This is a CORS header: when received, it signals to the server that a browser has been referred directly to the server by script served from another website as specified in the Origin: value field.

As this could be a security hole, the browser asks if the csi18n is prepared to honour requests from that Origin website. If yes, CORS requires a Access-Control-Allow-Origin: to be returned. If not, absence of that header indicates to the browser that the request will not or cannot be honoured.

TE

Do not use.

Used when your client is telling the service in what encoding it would like the response.

The only acceptable values are chunked, deflate and gzip. Any other value will cause a 501 Not Implemented response. All three may also include quality data of the form ;q=X.Y.

At this time TE: causes an error if invalid, and is ignored if valid.

Transfer-Encoding

Used when your client is telling the service what encoding is used for an upload.

The only acceptable value is chunked. Any other value will cause a 501 Not Implemented response.

Trailer

If received, the service will silently ignore this header.

Vary

Conducting a GET against /csi18n/beta0.4/{user}/{newmark}/{lang} will see a Vary: Accept, Authorization header returned.

The Vary: header is used to indicate to caches that the representation returned may also change with the Accept: header (because a different media type will see a different representation of the underlying resource) and the Authorization: header (because a user B may have preferred a different translation to user A.)

X- headers

The X- headers are peculiar to the csi18n service.

X-Testing-Dupe

The HTTP response line cannot always be captured in unit tests, so this header (which can always be captured) is emitted at the same time and with the same content as that response line.

The HTTP response line from the service can itself be overidden by Apache webserver in some circumstances, such as when using Range: to download.

Newmarks can contain any characters, but those should be percent-encoded. For example, "Hello/World" should be encoded as "Hello%2FWorld".

The csi18n service is blind to percent-encoding, so if you need the newmark percent-decoded, you'll need to do that yourself.

Keyholes

Superordinates may open a keyhole on their account. Closed by default, opening the keyhole allows the general public to access the "private" translations on your account. While others can read those translations, and can prefer one translation over another, the public is not allowed to create new translations: that privilege is reserved only to you and your subordinates.

To open the the keyhole, do a GET to /subscribers/me/keyhole, and PUT back the resource supplied with false turned to true. To close the keyhole, repeat with true turned to false

The keyhole may be closed by the Superordinate at any time. If this happens, outside users will receive 404 Not Found errors. If the keyhole is subsequently reopened, their requests (including preferences) will immediately start being honoured again.

Language codes, use of

This section applies to the language of the content transferred, not just the Accept-language header dealt with above.

When asking or offering a translation of, say, "Hello, world", your client will send and the server will expect a language code. This code is used to determine whether the translation is an appropriate response to a later request. For example: a client might upload "Hello, world" using the ISO-639-1 code "en". Someone later asking for a translation to "en" would be presented that first translation on the basis that the second "en" matches the first.

It should be noted that while "en" matches "en", it is not matched by "EN", "En" or "eN": the last three are assumed to be codes for a different language. This is bound to cause problems, as I see a standard around both "en-US" and "en-us", which this service currently treats as different codes: ISO639-1 seems to require all lowercase, whereas ISO3166-1a2 - also used to determine these codes - seems to mandate some uppercase.

Quite aside from the uppercase/lowercase question, there are two more big problems: neither of those two standards always provide a language code, nor do they handle user capability. Example: For the former, there's no standard way of identifying "Glasgow Scots" as used By Clarks' translation of Carroll's "Alice’s Adventures in Wonderland" ("sco" is insufficient given it admits no distinction with another dialect, such as Shetland.) For the latter, someone who only speaks Japanese might recognise his language as 日本画 (Nihonga) rather than "jp" in what to him would be a foreign alphabet.

This service does not solve these problems. Rather, it's set up so that the language code is whatever the users choose it to be, accepting any unicode character string as a language code. Swahili speakers who identify with "sw" will find others who agree, Swahili speakers who identify with "كِسوَهِل" will also find each other. Whether and how twains shall meet is their call to make. Possibly asking for a translation in either code. While a GET/HEAD can operate on multiple codes using comma separation, the other verbs do not. Thus a translation in both "sw" and "كِسوَهِل" would need to be POSTed twice, once for each.

A further advantage to this approach is that it can accommodate any language grouping from a whole language such as English or Chinese, down through the dialects, even to idiolects and on to imaginary languages such as Elvish, or Klingon.

Locks, preferences & Semi-random responses

The sticky record keeps track of when a particular response is required from a particular Visit Link, and it comes in two forms.

The preference applies only to the user who uploaded it, and is the basis of determining which translations are broadly acceptable. The lock applies to all users and may only be uploaded by the Visit Link owner.

Where a sticky hasn't been recorded, the service randomly chooses among the translations that are available.

Where multiple stickies are in place (For example, if you preferred a particular English translation, likewise a Danish one, then later ask for something matching en, da) then which is returned is undetermined.

If the translation appears in a link, the hello world</a> gives some text, then terminates the hyperlink. The browser now thinks it should be putting out html: there follows an IMG link. The link references a URL - bogus - which is certain to cause an error for the browser, then provides a javascript sequence alert("i h4xx0red u") to handle it. It then restarts the original link for the benefit of the remainder of the page.

The browser thinks the javascript is valid, when in fact it was uploaded by the translation, not the website owner; the translation uploader - the attacker - can thus make the browser do what *he* wants, not what the website owner wants. Here I've made the javascript bring up a simple alert box, but in practice it could be anything the attacker who uploaded it wants. If he didn't want to be noticed, the visitor who receives malign javascript might never know.

The way to handle such a problem is to sanitise text before it gets rendered by the browser. The character "<", for instance, is converted to "&lt;" before rendering. At rendering, it's safely printed out as the character "<" again, rather than as the start of an html tag. This sanitising causes any html to be redescribed as plain text, thus neutering the attack.

Relevance to csi18n service

There is no way to know from a translation whether it contains an XSS attack (which must be blocked) or merely describes an XSS attack (which should not be blocked.) Indeed, the service can't be sure whether a translation is destined for the browser which must be present to pull the attack off, vs - say - a postscript document.

Given this, the service makes no attempt to sanitise during upload, for storage, or when being returned to other people. It's the responsibility of the website owner / javascript programmer / developer to ensure that a returned translation has been safely sanitised for the environment being considered. (Exception: it's possible to directly browse the csi18n service - securing that limited case is my responsibility.)

tl:dr if you are putting a translation on your own website, or in your own app, or wherever - sanitise responses from this server before use!