I've been setting up continuous deployment recently for an application I'm working on, and as part of this process I'm uploading the release with sftp, using a restricted user account that is both chrooted (though I use a subfolder of the home directory to be extra-sure) and doesn't have shell access.

Since the application is written in PHP, I use composer to manage the server-side PHP library dependencies - which works very well. The problems start when I try to upload the whole thing to the server - so I thought I'd make a quick post here on how I fixed it.

In a previous build step, I generate an archive for the release, and put it in the continuous integration (CI) archive folder.

In the deployment phase, it unpacks this compressed archive and then uploads it to the production server with lftp, because I need to do some fiddling about that I can't do with regular sftp (anyone up for a tutorial on this? I'd be happy to write a few posts on this). However, I kept getting this weird error in the CI logs:

Very strange indeed! Apparently, lftp isn't known for outputting especially useful error messages when used in an automated script like this. I tried everything. I rewrote, refactored, and completelyturned the whole thing upside-downmultipletimes. This, as you might have guessed, took quite a while.

Commits aside, it was only when I refactored it to do the upload via the regular sftp command like this that it became apparent what the problem was:

The last line there instantly told me what I needed to know: It was failing to upload a symbolic link.

The solution here was simple: Unwind the symbolic links into hard links instead, and then I'll still get the benefit of a link on the local disk, but sftp will treat it as a regular file and upload a duplicate.

If you'd like to see the full deployment script I've written, you can find it here.

There's actually quite a bit of context to how I ended up encountering this problem in the first place - which includes things like CI servers, no small amount of bash scripting, git servers, and remote deployment.

In the future, I'd like to make a few posts about the exploration I've been doing in these areas - perhaps along the lines of "how did we get here?", as I think they'd make for interesting reading.....

I've recently been setting up dotnet on my Artix Linux laptop for my course at University. While I'm unsure precisely whatdotnet is intended to do (and how it's different to Mono), my current understanding is that it's an implementation of .NET Core intended for developing and running ASP.NET web applications (there might be more on ASP.NET in a later 'first impressions' post soon-ish).

While the distribution is somewhat esoteric (it's based on Arch Linux), I've run into a number of issues with the installation process and getting Monodevelop to detect it - and if what I've read whilst researching said issues, they aren't confined to a single operating system.

Since I haven't been able to find any concrete instructions on how to troubleshoot the installation for the specific issues I've been facing, I thought I'd blog about it to help others out.

Installation on Arch-based distributions is actually pretty easy. I did this:

sudo pacman -S dotnet-sdk

Easy!

Monodevelop + dotnet = headache?

After this, I tried opening Monodevelop - and found an ominous message saying something along the lines of ".NET Core SDK 2.2 is not installed". Strange. If I try dotnet in the terminal, I get something like this:

Turns out that it's a known bug. Sadly, there doesn't appear to be much interest in fixing it - and neither does there appear to be much information about how Monodevelop does actually detect a dotnet installation.

Thankfully, I've deciphered the bug report and done all the work for you :P The bug report appears to suggest that Monodevelop expects dotnet to be installed to the directory /usr/share/dotnet. My system didn't install it there, so went looking to find where it did install it to. Doing this:

whereis dotnet

Yielded just /usr/bin/dotnet. My first thought was that this was a symbolic link to the binary in the actual install directory, so I tried this to see:

ls -l /usr/bin/dotnet

Sadly, it was not to be. Instead of a symbolic link, I found instead what appeared to be a binary file itself - which could also be a hard link. Not to be outdone, I tried a more brute-force approach to find it:

sudo find / -mount -type d -name "dotnet"

Success! This gave a list of all directories on my main / root partition that are called dotnet. From there, it was easy to pick out that it actually installed it to /opt/dotnet.

Instead of moving it from the installation directory and potentially breaking my package manager, I instead opted to create a new symbolic link:

sudo ln -s /opt/dotnet /usr/share/dotnet

This fixed the issue, allowing Monodevelop to correctly detect my installation of dotnet.

Templates

Thinking my problems were over, I went to create a new dotnet project following a tutorial. Unfortunately, I ran into a number of awkward and random errors - some of which kept changing from run to run!

I created the project with the dotnet new subcommand like this:

dotnet new --auth individual mvc

Apparently, the template projects generated by the dotnet new subcommand are horribly broken. To this end, I re-created my project through Monodevelop with the provided inbuilt templates. I was met with a considerable amount more success here than I was with dotnet new.

HTTPS errors

The last issue I've run into is a large number of errors relating to the support for HTTPS that's built-in to the dotnet SDK.

Unfortunately, I haven't been able to resolve these. To this end, I disabled HTTPS support. Although this sounds like a bad idea, my reasoning is that in production, I would always have the application server itself run plain-old HTTP - and put it behind a reverse-proxy like Nginx that provides HTTPS, as this separates concerns. It also allows me to have just a single place that implements HTTPS support - and a single place that I have to constantly tweak and update to keep the TLS configuration secure.

To this end, there are 2 things you've got to do to disable HTTPS support. Firstly, in the file Startup.cs, find and comment out the following line:

app.UseHttpsRedirection();

In a production environment, you'll probably have your reverse-proxy configured to do this HTTP to HTTPS redirection anyway - another instance of separating concerns.

The other thing to do is to alter the endpoint and protocol that it listens on. Right click on the project name in the solution pane, click "Options", then "Run -> Configurations -> Default", then the "ASP.NET Core" tab, and remove the s in `https in the "App URL" box like this:

By the looks of things, you'll have to do this 2nd step on every machine you develop on - unless you also untick the "user-specific" box (careful you don't include any passwords etc. in the environment variables in the opposite tab in that case).

You may wish to consider creating a new configuration that has HTTPS disabled if you want to avoid changing the default configuration.

Found this useful? Got a related issue you've managed to fix? Comment below!

For one reason or another I found myself a few days ago inspecting the code behind Pepperminty Wiki's full-text search engine. What I found was interesting enough that I thought I'd blog about it.

Forget about that kind of Search Engine Optimisation (the horrible click-baity kind - if there's enough interest I'll blog about my thoughts there too) and cue the appropriate music - we're going on a field trip fraught with the perils of Unicode, page ids, transliteration, and more!

Firstly, I should probably mention a little about the starting point. The (personal) wiki that is exhibiting the slowness has 546 ~75K words spread across 546 pages. Pepperminty Wiki manages to search all of this in about 2.8 seconds by way of an inverted index. If you haven't read my last post, you should do so now - it sets the stage for this one - and you'll be rather confused otherwise.

2.8 seconds is far too slow though. Let's do something about it! In order to do something about it, there are several other things that need explaining before I can show you what I did to optimise it. Let's look at Pepperminty Wiki's search system first. It's best explained with the aid of a diagram:

In short, every page has a numerical id, which is tracked by the ids core class. The search system interacts with this during the indexing phase (that's a topic for another blog post) and during the query phase. The query phase works something like this:

The inverted index is loaded from disk in my personal wiki the inverted index is ~968k, and loads in ~128ms)

The inverted index is queried for pages in that match the tokenised query terms.

The results returned from the query are ranked and sorted before being returned.

A context is extracted from the source of each page in the results returned - just like Duck Duck Go or Google have a bit of text below the title of each result

Said context has the search terms hightlighted

It sounds complicated, but it really isn't. The complicated bit comes when I tried to optimise it. To start with, I analysed how long each of the above steps were taking. The results were quite surprising:

Step #1 took ~128ms on average

Steps #2 & #3 took ~1200ms on average

Step #4 took ~1500ms on average(!)

Step #5 took a negligible amount of time

I did this by setting headers on the returned page. Timing things in PHP is relatively easy:

This gave me a general idea as to what needed attention. I was surprised to learn that the context extractor was taking most of the time. At first, I thought that my weird and probably inefficient algorithm was to blame. There's no way it should be taking 1500ms!

So I set to work rewriting it to make it more optimal. Firstly, I tried something like this. Instead of multiple sub-loops, I figured out a way to do it with just 1 for loop and a few calls to mb_stripos_all().

Unfortunately, this did not have the desired effect. While it did shave about 50ms off the total time, it was far from what I'd hoped for. I tried refactoring it slightly again to use preg_match_all(), but it still didn't give me the speed boost I was after - only another 50ms or so.

To get some answers, I brought out the big guns and profiled it with XDebug.

Upon analysing the generated profile it immediately became clear what the issue was: transliteration. Transliteration is the process of removing the diacritics and other accents from a string to make it easier to compare with other strings. For example, café becomes Café. In PHP this process is a bit funky. Here's what I do in Pepperminty Wiki:

In short, it preprocesses a chunk of text so that it can be easily used by the search system. In my case, I transliterate search queries before tokenising them, source texts before indexing them, and crucially: source texts before extracting contextual information.

The thing about this wonderful transliteration process is that, at least in PHP, it's really slow. Thinking about it, the answer was obvious. Why bother extract offset information when the inverted index already contains that information?

The answer is: you don't upon refactoring the context extractor to utilise the inverted index, I managed to get it down to just ~59ms. Success!

Next up was the query system itself. 1200ms seems a bit high, so while I was at it, I analysed a profile of that as well. It turned out that a similar problem was occurring here too. Surprisingly, the page id system's getid($pagename) function was being really slow. I found 2 issues here.

Firstly, I was doing too much Unicode normalisation. In the page id system, I don't want to transliterate to remove diacritics, but I do want to make sure that all the diacritics and accents are represented in the same way.

If you didn't know, Unicode has a both a character for letters like é (e-acute), and a code-point for the acute accent itself, which gets merged into the previous letter during rendering. This can cause a page to acquire 2 (or even more!) seemingly identical ids in the system, which caused me a few headaches in the past! If you'd like to learn more, the article on Unicode normalisation I linked to above explains it in some detail. Thankfully, the solution is quite simple. Here's what Pepperminty Wiki does:

Normalizer::normalize($string, Normalizer::FORM_C)

This ensures that all accents and other weird characters are represented in the same way. As you might guess though, it's slow. I found that in the getid() function I was normalising both the page names I was iterating over in the index, as well as the target page name to find in every loop. The solution here was simple:

Don't normalise the page names from the index - that's the job of the assign() protected method to ensure that they are suitably normalised when adding them in the first place

Normalise the target page name only once, and then use that when spinning through the index.

Implementing these simple changes brought the overall search time down to 700ms. The other thing to note here is the structure of the index. I show it in the diagram, but here it is again:

1: Booster

2: Rocket

3: Satellite

The index is basically a hash-table mapping numerical ids to their page names. This is great for when you have an id and want to know what the name of the page associated with it is, but terrible for when you want to go in the other direction, as we need to do when performing a query!

I haven't quite decided what to do about this. Obviously, the implications on efficiency are significant whenever we need to convert a page name into its respective numerical id. The problem lies in the fact that the search query system travels in both directions: It needs to convert page ids into page names when unravelling the results from the inverted index, but it also needs to convert page names into their respective ids when searching the titles and tags in the page index (the index that contains information about all the pages on a wiki - not pictured in the diagram above).

I have several options that I can see immediately:

Maintain 2 indexes: One going in each direction. This would also bring a minor improvement to indexing new and updating existing content in the inverted index.

Use some fancy footwork to refactor the search query system to unwind the page ids into their respective page names before we search the pages' titles and tags.

While deciding what to do, I did manage to reduce the number of times I convert a page name into its respective id by only performing the conversion if I find a match in the page metadata. This brought the average search time down to ~455ms, which is perfectly fine for my needs at the moment.

In the future, I may come back to this and optimise it further - but as it stands I'm getting to the point of diminishing returns: Where every additional optimisation requires twice the amount of time to implement as the last, and only provides a marginal gain in speed.

To this end, it doesn't seem worth it to spend ages tackling this issue now. Pepperminty Wiki is written in such a way that I can come back later and turn the inner workings of any part of the system upside-down, and it doesn't have any effect on the rest of the system (most of the time, anyway.... :P).

If you do find the search system too slow after these optimisations are released in v0.17, I'd like to hear about it! Please open an issue and I'll investigate further - I certainly haven't reached the end of this particular lollipop.

Found this interesting? Learnt something? Got a better way of doing it? Comment below!

I was processing some images for someone recently, and I ended up encountering issues with colour balance. The images looked okay on my monitor, but as soon as I printed them out, they took on a slight red-orange tint. Very interesting. I suspect that the root cause lies in some complex colourspace or device colour profile issue (which will take me ages to debug and track down), but I stumbled upon a filter in GIMP called Retinex, which provided a very useful workaround.

According to the GIMP documentation, retinex is an algorithm that improves the appearance of images that were taken in sub-optimal lighting conditions. It's probably best illustrated with an example:

As you can see, the things on the desk are much easier to pick out in the right image as compared to the left one. Apparently, the algorithm was invented at NASA's Langley Research Centre in 2004 to automatically enhance astronomical photographs - and has a full name of Multi-Scale Retinex with Color Restoration (MSRCR) - which is a bit of mouthful!

During my own testing, I've found it be most effective on outdoor pictures, or pictures with poor lighting. I've also found it to be rather prone to introducing noise into the image - so if a simple automatic white balance correction will suffice, then that's probably a better filter to apply than this one.

It's one of those things that's really useful to know about - because it might just solve your problem one day! To that end, I wanted to blog about it so that I don't forget :P