This post has
76
Replies |
9
Followers

Our team has been working day and night to fix this, and I'm trying (as much as I am able!) to resist digging into the how? and why? too much until after we are through the crisis and have all the systems restored.

Be assured, though, that we will thoroughly examine this incident and take from it a lot of lessons and make the necessary changes.

We understand that our products have changed from 'needing to sync once in a while' to 'really needing to connect', and we are re-architecting both our infrastructure and our applications to reflect that, and to be more robust against failure. Some months ago we added complete offline backup/restore capability to Proclaim, so that it can run and even move presentations between systems without the Internet. Our desktop and mobile apps should work offline, except for online-only features (which Logos 6 introduced more of), and books that aren't downloaded.

(I've seen more reports than I expected of issues with the mobile app; the mobile apps should use any downloaded books just fine when there's no Internet -- I read in our mobile apps on the plane last week (to ETS and AAR/SBL conferences, where I am now) without any issues. But I know that iOS in particular can make its own decision to delete non-user-created data, like our downloaded ebooks, to manage space on the device, and when we get through this we'll look carefully at the source of mobile issues to see if this was the cause of unexpected missing books (and if so, how we can prevent it), or if there's something else going on.)

We have recently made dramatic changes in our back-end systems, particularly in the area of storage management, in anticipation of Logos 6. We needed both more online storage (for increased user count, the huge databases of map titles for Atlas, the media-related features, etc.) and increased reliability, in light of the 'Internet-only' nature of many new features.

Our team has long desired to have a completely mirrored back-end infrastructure, in which every server and every database is live-mirrored on a hot-switchable system. This, of course, requires essentially a 100% increase in systems cost, which runs into the hundreds of thousands of dollars. And it's still not a perfect system, because while it protects against hardware failures, it exactly replicates data/software problems to the secondary machine.

For a long time I resisted this, because the cost was extraordinary and we had high confidence that any failure could be addressed in just a few hours with replacement hardware (kept on hand) and/or restoring from backups. We felt the risk of a few hours vs. the ongoing doubling of (already expensive) systems-cost wasn't worth it.

As our model has changed, though, we've realized that we need to be able to guarantee up-time and reliability, as many of you have correctly pointed out in forum posts this weekend. We also realized that our existing platform was prohibitively expensive to scale up to our future storage needs, and that our existing data center in Bellingham (professionally outfitted, but small by today's data center standards) wasn't enough to be our only solution.

So we designed a new architecture based on the latest trends in data center management, and using some new (to us) platforms. We also arranged for a secondary location south of Seattle (100 miles away from Bellingham) where we could run systems that we could quickly switch to. The secondary data center was set up, provisioned with hardware, and planned to be ready for the Logos 6 launch.

(Yes, we did discuss the idea of having the second data center even further away; a lot of discussion went into hosting it in Phoenix, near our Tempe office, so that any northwest catastrophe couldn't wipe out both centers. But there was some advantage to being able to have the team visit the second location in person at times, and it was hard to imagine a scenario in which both Tukwila and Bellingham were both catastrophically destroyed and unrecoverable for a long period of time.)

All that remained was the high-bandwidth data link which would allow these two data centers to be completely and directly peered, without the same capacity constraints that were already an issue with being in the smaller Bellingham data center. We've been waiting for weeks for this data link to go live (it needs to be connected by a third party), and this weekend's failure happened before it did.

In theory, even without the secondary data center we should have been able to recover in hours. We employ technologies that should report drive and component failures, and which support hot-swapping of components, spares of which we keep on hand.

The report I have is that the tiered configuration of clusters -- clusters of drives represented as one drive to a node, clusters of nodes reported as one system to a higher-level system, etc. -- obscured a low-level error. A component at the bottom failed, and then two others, but the small failures were somehow obscured by the redundancy built into the system -- at the higher level the whole system was fine when the first small piece failed, an it didn't report the low-level failure. (This may be a mistake in our architecture, or in our configuration of the reporting software.) Then other components failed before we knew to replace the first failed component, and 'critical mass' was achieved.

This weekend the goal has been to get back up and running. Our team worked very hard to ensure that Proclaim was live for the majority of users. The fact that the Tukwila data center was only 100 miles away meant we were able to drive there and fetch the redundant equipment on a Saturday and start deploying new hardware quickly. Another team member dove to a Seattle computer store and bought every hard drive in stock so we could build new storage arrays. The team has been going on very little sleep and working hard to restore everything. I'm told there's no data loss -- just a deployment problem.

(There are some technical details I'm still fuzzy on here, but apparently another issue relates to the number and size of data objects we have, and the fact that it takes longer than we anticipated to move the huge amount of data we now manage around when you swap new storage components into the system.)

I'm providing a very high-level overview of the problem, and I am simplifying the description and providing it as I understand it right now. I could be describing it wrong, and once we know more about what happened we'll post a better explanation. (For example, I'm told that the storage system was back up relatively soon, but despite reporting health simply wasn't performing as it should. We don't yet know why.)

Ultimately the fault is mine.

Whatever the technical investigation reveals, I am ultimately responsible to deliver the products and services you rely on, and I let you down.

Bob, thank you for your transparency regarding this significant issue. I appreciate that you "own" this and am confident that you and your team will do what it takes to learn from this experience and continue to make future products even better. May the Lord give you the wisdom and strength you need for the days ahead.

Using adventure and community to challenge young people to continually say "yes" to God

Bob, Faithlife is an awesome corporation and I still feel that way. Thanks for the great explanation and keep up the good work. I know that others were probably affected more than I was, but it felt kind of good to read from my paper Bible rather than the iPad For a couple of days.

Peace

Romans 14:19 (NRSV) 19 Let us then pursue what makes for peace and for mutual upbuilding.

Thanks for the update. I work in IT myself, and I know full well the potential system and hardware failures that can happen. We do the best we can, and there's always something we can do better.

Our church is poor (space rented from another church, no internet, etc.), so we don't run Proclaim (just Powerpoint from my own laptop). I always made sure everything can be run offline; I use a tablet for studies and reading, but I don't preach from it.

Look forward to everything being back online again. Keep up the good work. My prayers are with you and all the IT people hard at work getting things back to normal.

Thanks for the info. I have been using Logos since Libronrix days. I do not remember any type of problem like this before. You all have outstanding customer service!! I woukd like to see people walk in love!! Does Christ condemn? I think not!! Thanks for all of your and your amazing staff's hard work!! God Bless!!

So we designed a new architecture based on the latest trends in data center management, and using some new (to us) platforms. We also arranged for a secondary location south of Seattle ...

All that remained was the high-bandwidth data link which would allow these two data centers to be completely and directly peered, without the same capacity constraints that were already an issue with being in the smaller Bellingham data center. We've been waiting for weeks for this data link to go live (it needs to be connected by a third party), and this weekend's failure happened before it did.

I don't think you need to peer everything in real-time. There may be cost-saving alternatives.

I was a qualified reactor operator for the US Navy for many years. Our systems were robust and redundant. They were supposed to tell us what was happening...yet the systems were also extremely complex. A simple failure could often lead to complete shutdowns...that were extremely difficult to diagnose and correct.

In other words, life isn't perfect, and when you deal with large, complex systems, it will often bite you. Thanks for all of the hard work to get it back up and running as quickly as possible.

The last seven years of my career with the Federal Aviation Administration I worked with automation systems in air traffic control. As important as air traffic control is we had systems failures there just like anywhere else. What we did was try and learn from it, did our very best to make sure it didn't happen again, and moved on.

Thanks for being willing to take responsibility for this Bob but please don't beat yourself up too much over this. It is something to learn from and show you ways to make improvements to your system. Yes some people were inconvenienced but it is just something that happens with any technology company from time to time whether we are talking about Microsoft, Apple, or any other company.

Thanks for being willing to come on the forums like you do, be upfront with your customers, and for building great Bible software!

Bob, thanks for speaking to us and being so transparent at this tough time. I appreciate your honesty and hard work and will continue to pray for all of you as I remain an ardent user of Logos. God bless you and your team as you continue your work of completely restoring the whole system.

Thank you Bob for the explanation. As I said elsewhere this was an inconvenience not a disaster. Fortunately for me other than a delay in getting a free book Verbum 6 functioned perfectly for me (as I did not wish to make a visual copy or use the atlas this blip could have gone unnoticed by me had i not been an user of the forums). I am glad you are getting things back up to 100% and going ahead with the strategy that should make this event a one time occurrence. I know this incident just happened a few weeks too soon, murphy's law works that way. I remember over a decade ago, I had my mac backed up except my iTunes. I had bought the CDROMS to back them up, even reorganized my music into 750 mb folders the day before i was going to do it my hard drive died. It was an extreme pain needing to redo all my mp3s, but i also made sure I got a bigger back up hard drive so I always have a few backup of my machine. God bless you and all your employees.

Thanks for the update. I work in IT myself, and I know full well the potential system and hardware failures that can happen. We do the best we can, and there's always something we can do better. ..... Look forward to everything being back online again. Keep up the good work. My prayers are with you and all the IT people hard at work getting things back to normal.