10-Apr-2019
14:02 Partial loss of connectivity to several hosts in S1 datacentre.
Servers were all reachable again in about 90 seconds.
Investigations are underway to find the cause and prevent any repeat.
25-Sep-2018
04:00 We have been advised of an "urgent works" requirement by the datacentre
and are expecting an outage of 30-60 minutes from 4am.
22-Jul-2018
09:06 Become aware that we have not been getting RADIUS auth requests.
Problem has been identified as telco RADIUS server died at 6:27
and went un-detected by their monitoring. Service restored 9:58
after talking with telco engineers and they rebooted their server.
26-Apr-2018
18:00 We are upgrading some of our Melbourne infrastructure, and are
anticipating up to 60 minutes of server downtime for our mel-2
and mel-3 servers. These are primarily quaternary nameserver,
backup monitoring and some VPN services. No impact is expected
to any customers.
05-Apr-2018
11:45 The shutdown of several servers planned for 11am was deferred
45 mins to facilitate unplanned access difficulties at the
datacentre. All hosting on syd-5, dataretention and liveweatherviews
systems is anticipated to be down for 30 minutes while a new
storage subsystem is installed, requiring servers to be relocated
within the rack, and additional fibre interfaces to be installed.
ILO back 12:00
All other services back 12:12
23-Nov-2017
00:07 Appears AAPT have an aggregation switch failure and half of our
upstream providers network is down. They're working on it but no ETR.
All ADSL and NBN services affected.
(Resolved 01:00)
27-May-2017
10:00 Due to power infrastructure upgrades at our office, some
services (billing queries etc) will be offline for approx
1hr. (Work completed. System was offline 10:08-10:48am)
07-Apr-2017
10:12 Billing, accounts and enquiries, office phones, weather
and aviation data all offline. Construction workers 400m
away have put an excavator through the main street cable.
Waiting on repair crew.
(13:15. Repair crew still at least 3 hours away. Taken
action myself, it's only a 30-pair cable, have done a
temporary repair to at least get back online)
09-Dec-2016
07:19 Some systems demonstrating degraded performance.
Within a few minutes, virtually all systems were non-responsive.
Half-million dollar SAN has crashed, and vendor-provided
configuration contains an error preventing failover to the
secondary unit! Rebuild under way, but it's taking forever.
Finally recovered and all systems back online at 15:39
30-Sep-2016
10:47 Packet loss on all external links. Under investigation.
(Partially Restored 11:02 but some paths still showing loss)
(Fully restored 12:00, massive DDoS taking down all links)
01-Aug-2016
03:02 syd-1 server became unresponsive. Engineers called, services
restored 03:28. Investigation into cause to commence later today.
Some websites, most authentication, mail affected.
04-Jun-2016
10:45 All services to both datacentres are isolated. Engineers are
enroute to sites now to see what has happened. No ETR.
(Restored 12:46, waiting on word as to the cause)
20-May-2016
18:30 We're declaring an urgent maintenance window of 30 minutes
to facilitate urgent maintenance on one of the blade servers.
Outage expected to be 10 minutes. Affected services will
include email, authentication and some web services.
Shutdown at 18:23, back online 18:43.
19-May-2016
06:37 Loss of connectivity to all servers in both datacentres in
Sydney. Engineers investigating.
Resolved and all services back 09:27. Awaiting full report.
25-Feb-2016
06:30 Due to a developing fault in a fibre-optic service, engineers
will be replacing a link at around 6:30am. All services except
for our 3G mobile services, will be unavailable during this
time. Duration is expected to be less than 1 minute.
Work was completed ahead of time. Outage commenced 05:49:50
and full connectivity restored 05:50:30 (40 seconds outage)
13-Jan-2016
16:00 It is with some sadness, I report the decommissioning of our
longest-running server. "Starone" was built on 1-April-1997
and brought into service shortly after, where it has run 24/7
ever since. After starting to throw hard disk errors a while
ago, a replacement platform has been brought online to take
over all of its tasks. 18 years 9 months and 12 days.
02-Jan-2016
10:54 Works completed. New servers running, all tested services are
back online, no mail missed. Will continue to check everything
has migrated properly and monitor for any problems. In the event
users do encounter issues, please call 0409 578 660.
02-Jan-2016
10:30 A 30-minute maintenance window declared for migration of our
remaining servers from Global Switch datacentre at Ultimo to
our new facilities at S1.
15-Dec-2015
15:10 Notified by datacentre that critical switch fabric has failed, and
an emergency maintenance window was declared to replace it. All
services down at 15:12, restored approx 15:22. Not anticipating
any more issues. This is also the explanation for the brief blip
at 14:29 this afternoon.
15-Dec-2015
14:29 Brief loss of connectivity to all core server. Investigating.
07-Sep-2015
17:21 Loss of Australian VoIP services. Upstream carrier appears to have
changed their infrastructure without advising wholesalers. After
finally getting someone there to identify the cause, it was a quick
fix to restore services. Most incoming calls redirected to mobiles
and all services restored by 18:45
22-Jul-2015
21:43 Intermittent problems with several of our servers over the last 2
days has been identified as a "perfect storm" confluence of three
events. 1. A controller failure in the SAN, resulting in degraded
IO performance; 2. ISCSI driver problem in the new ESXI VM host
causing further SAN problems and 3. Loss of a blade server, causing
additional load on the remaining heads, esacerbating a bad situation.
Problems resolved by relocating all our hosts onto a brand new SAN
and new higher-performance servers. We are assured there should be
no further performance issues!
17-Jul-2015
09:02 SAN connectivity issues in the datacentre resulted in two servers
losing file systems and rebooting. Little impact to customers, some
webservers (CMS systems) affected for approx 12 minutes, cameras,
weather data etc inoperative for the same time.
06-Jul-2015
16:30 Mail, authentication and webservers lost connectivity.
Investigating root cause. VMs restarted, systems back online
at 16:40.
01-Jul-2015
10:00 Major disruption to infrastructure at our upstream. Claimed to be due
to the leap-second inserted at 23:59:59 UTC (ie, 10AM local time),
causing fibre-optic links to lose timing integrity and shutdown.
Required powercycling and disabling of time sync at multiple sites.
Restored 11:25.
18-Jun-2015
05:09 Connectivity loss at transit provider. Cause unknown, but their tech
people are working on it. No ETA or cause known yet. Will update as
I get anything. All our equipment is working, traffic is dying 1 or 2
hops into the carriers network.
Restored 6:57am. Word back from the carrier is that they were doing
"network maintenance" that went seriously wrong. They had not issued
maintenance notice or hazard warnings because in their view it was
"inconsequential maintenance". It is unclear why both our primary and
backup links failed as they are supposed to be with different carriers
on different infrastructure.
19-May-2015
06:45 Connectivity to both datacentres lost! Completely unrelated to all the
other server migration works. Post-incident analysis shows that the
main route-server lost its mind at 6:45. The backup server should have
taken over, but as murphy would have it - the backup server was away
being upgraded so full seamless-failover functionality would work
between both sites! Primary server rebuilt, backup server will be back
in a couple of days and will be reinstalled then. (This server is not
part of our infrastructure and we were not advised it was being taken
out of service - an oversight that is being addressed!)
All services restored between 09:15 and 09:22
18-May-2015
20:27 ali-syd-4 has been relocated and decommissioned
$ uptime
8:26PM up 654 days, 6:41, 1 user, load averages: 0.01, 0.01, 0.00
17-May-2015
20:17 The second of this weekends planned infrastructure upgrades has completed
without any evident problems. Syd2 has been migrated to our new server
platform at a different datacentre. This server runs (amongst other things)
most of out monitoring sysem, VoIP system and weather site.
Like syd3, it has been a faithful, reliable and hard working server.
[root@ALI-SYD-2 /]# uptime
08:40PM up 556 days, 20:45, 12 users, load averages: 1.20, 0.80, 0.78
16-May-2015
11:14 The first of this weekends planned infrastructure upgrades has completed
without any evident problems. One of our oldest remaining servers (syd3)
has been migrated to our new server platform at a different datacentre.
It's been a faithful and reliable server, will be a pity to turn it off!
[root@ALI-SYD-3 /]# uptime
11:40AM up 821 days, 13:50, 24 users, load averages: 0.00, 0.00, 0.02
Services were stopped on old syd3 at 11:14:30, the final data sync
completed and the new server brought online at 11:27:05
12-May-2015
10:31 ProxyRadius server restored, authentication working again. Carrier is now
investigating why their failover system failed to work, but apologise for
any inconvenience.
12-May-2015
06:02 Our upstream provider has had a failure in part of their infrastructure,
resulting in them not passing us authentication requests. They are working
on it but have no ETR. This will affect any dial-up customer, and anyone
on ADSL whos modem drops off-line and has to re-authenticate.
14-Apr-2015
20:35 Everything is back online and running. Problems getting other people in
to the restricted access areas of secure datacentres caused significant
delays. We have no identified the cause of these mysterious and intermittent
"non-responsive" periods and replacement hardware will be ordered tomorrow.
14-Apr-2015
16:42 Our main authentication/mail/web server has become non-responsive.
Support staff are all interstate, backup staff are being recalled to the
datacentre.
13-Feb-2015
08:36 Our main mail/authentication/web server has again become non-responsive.
System remotely restarted, back online 8:55. Suspect hardware identified
and replacements being arranged.
29-Jan-2015
08:48 Our main mail/authentication/web server has become non-responsive.
System restarted remotely and back online 8:53. Investigation underway.
19-Jan-2015
07:00 We are relocating our Melbourne rack to a different part of the datacentre.
This will result in approx 2hrs downtime. This will only affect tertiary
nameservers and some monitoring. Customer services, mail, authentication,
accounting, websites etc will be unaffected.
06-Jan-2015
18:22 Due to a failed power supply, the ALI offices are currently without
power or on very limited power. Some services may be unavailable
or will be intermittently available until further notice, but we
hope replacement equipment will be here within 24 hours.
Affected services will include on-line bill enquiries and payments,
weather, lightning, some skycam data and aircraft tracking systems.