Author
Topic: Core Redundancy:What if? (Read 11980 times)

This thread is more of a poll of if this is possible for me to achieve or if you as Gurus(yes you are) see any obstacles that I should be aware of before attempting or eve spending time on this.Maybe the majority of users feels that this is a really good killer feature that could be developed by someone more skilled(obviously I am not)

Having worked in Telecoms for a decent number of years I like the idea of redundancy.Is there currently or are there any plans on doing so for the Core?I know that the mtbf for computer hw is fairly high but still I cannot stop thinking about this.Off course there cannot be total redundancy of like alarm sensors, phonelines etc etc. but...

Let me explain a bit more what is important for me. Without rewriting kernels and patching(cause that's way out of my leage). But by using the excisting components even more(even that is a challenge).

Lmce is controlling the whole house(or actually we are with the help of..) how cool isn't that?Although. Assuming you have a proper UPS and that sort of easy redundant things sorted. What happens when there is a breakage inside the Core, like CPU, Hdd etc?The alarm system, how does it react to a hw failure in the Core?Or what happens when the system locks/freezes totally, yeah I know this is not a windows machine but it may happen.Typically the burgler will try to disarm your system and nothing is impossible. I am not saying that the LMCE should be used to guard Pentagon or the white-house but for most humans their belongings are equally important, are you with me what I try to depict?

I would like to know if it is possible to somehow easily configure the Core to have a close friend(stdby). So all config files will be stored externally(for example) maybe and the other core is brought up to life(from suspend maybe) when the Master Core stops responding to whatever is used as status check(echo reply for example)and starts to load the config from the external hdd's and when it's up it will poll each device for their status and send an message to the orbiters(like the UE = mobile phone)it will also make sure the Master Core is really dead. Maybe even a script in the Master Core will reset itself when a certain trigger is met and a message is broadcasted to the orbiters.

Another unanswered question(in my head) is how the routing could be sorted, however..I am not loooking for a hsrp/vrrp setup(even though it would be really nice) but something simpler.

To my understanding this should be possible to achieve in the current LMCE. Am I just being dumb or is it possible?

sorry for all the txt but I want to make sure my idea is understood and not taken as an "I want this, please deliver me this.." kind of thread because I would like to contribute with what I can if possible off course(my time and skills are kinda show-stoppers though)..

I'm interested in this as well. I tried Pluto awhile ago and I have been toying around with LMCE a little bit, but I am afraid to fully commit to the whole house solution because I haven't seen a way to allow redundancy. I am worried that I will devote full control of the house and then the core will die and the family will never let me forget the time when there was "no tv for a week."

On my network right now there is a database server that provides databases for bacula (backup solution), serveral mythtv boxes, zarafa (exchange type mail), etc and it uses a RAID 5 setup with nightly backups to another RAID 5 file server. Is it possible to externalize the LMCE database so that, in the event of catastrophic hardware failure, I could bring another machine online and just tell it to use the same database? Or is there just too much going on behind the scenes to make this kind of thing work?

By the way, thanks to everyone that works on this project. I am truly excited about implementing a whole house solution eventually (hopefully sooner, rather than later).

This thread is more of a poll of if this is possible for me to achieve or if you as Gurus(yes you are) see any obstacles that I should be aware of before attempting or eve spending time on this.Maybe the majority of users feels that this is a really good killer feature that could be developed by someone more skilled(obviously I am not)

Having worked in Telecoms for a decent number of years I like the idea of redundancy.Is there currently or are there any plans on doing so for the Core?I know that the mtbf for computer hw is fairly high but still I cannot stop thinking about this.Off course there cannot be total redundancy of like alarm sensors, phonelines etc etc. but...

Let me explain a bit more what is important for me. Without rewriting kernels and patching(cause that's way out of my leage). But by using the excisting components even more(even that is a challenge).

Lmce is controlling the whole house(or actually we are with the help of..) how cool isn't that?Although. Assuming you have a proper UPS and that sort of easy redundant things sorted. What happens when there is a breakage inside the Core, like CPU, Hdd etc?The alarm system, how does it react to a hw failure in the Core?Or what happens when the system locks/freezes totally, yeah I know this is not a windows machine but it may happen.Typically the burgler will try to disarm your system and nothing is impossible. I am not saying that the LMCE should be used to guard Pentagon or the white-house but for most humans their belongings are equally important, are you with me what I try to depict?

I would like to know if it is possible to somehow easily configure the Core to have a close friend(stdby). So all config files will be stored externally(for example) maybe and the other core is brought up to life(from suspend maybe) when the Master Core stops responding to whatever is used as status check(echo reply for example)and starts to load the config from the external hdd's and when it's up it will poll each device for their status and send an message to the orbiters(like the UE = mobile phone)it will also make sure the Master Core is really dead. Maybe even a script in the Master Core will reset itself when a certain trigger is met and a message is broadcasted to the orbiters.

Another unanswered question(in my head) is how the routing could be sorted, however..I am not loooking for a hsrp/vrrp setup(even though it would be really nice) but something simpler.

To my understanding this should be possible to achieve in the current LMCE. Am I just being dumb or is it possible?

sorry for all the txt but I want to make sure my idea is understood and not taken as an "I want this, please deliver me this.." kind of thread because I would like to contribute with what I can if possible off course(my time and skills are kinda show-stoppers though)..

BrJohan

The simple answer is... No redundancy at the level you are describing here.

However you can RAID your storage in a separate NAS and you can back that RAID'd data up off site too.

But none of that protects you from a motherboard failure in your Core for example... if you get a major failure in the Core's hardware then that will bring your whole system down. Now you could have a 100% identical Core ready for that eventuality and have a duplicate of the boot drive ready to go... then if you get a failure you just power up the backup Core and your back in business.

However I have to say that this approach paranoid in the extreme ;-)

My current home Core has been up 24/7 for nearly 2 years now... so the MTBF is pretty good.

I know..Although, Murphy(the guy who makes the impossible happen) is a close friend of mine and I would actually assume that if I setup the Core with it's security functions. It will go down when I'm on holiday with the family in the country far far away..Ok if only the tv or video sessions where to be disturbed but when talking security we bring the system uptime to a different level. At least I expect the system to have some sort of a parachute in case something goes wrong. Just a Very hot summerday can bring a "computer" into sleeping mode. Ok, I am not going to try to convince the skeptics about the positive side with node redundancy I am strange, I know

Ok will start a small project on my own on this then.I will start with that Raid to Nas. starting with the media folders(video,audio and pictures) to get something to base it on, Sounds like something that is already existing in the wiki, will search and read that.Next step will be to figure out how and if possible to have a system sleeping(for power consumption matter) and still have a poll from the same node(sounds a bit contradicting I know...)Then I will start experimenting in how to wake up a pc I have seen it in the bios but never used it for real. Think even there is a wiki for that regarding media directors.I don't see much more than that that has to be settled to make this work actually. Except for the hw cost and setup if going for full redundancy with plcbus control etc

Going to give a try with a stdby core that has similar hw but only controls/monitors the security first though. Otherwise the $$$ will make this not so attractive.

Questions:

For LMCE, is there a sw architecture page that describes the architecture in more detail? Or is it common Linux/Kubuntu architecture/knowledge that applies?Just would like to find out what directories are vital for a Core to come up and what directories could be left out since they may be Node specific(hw config etc)What happens with the alarm when all is locked in case of core dies? will it unlock itself, or stay alarmed?

Lets see how far I will come on the limited "free" time given as a parent. Will do extensive research on the matter and maybe it's really an overkill to the extreme

I suggest looking at the Programmers Guide on the wiki. There is a description of the architecture in the WIKI.

It is a highly distributed messaging system that is very strongly typed (every possible command, event, and data are defined in the master database, are compiled into C++ code, and are used throughout the system), as such, while each piece is relatively small, the interaction of all the individual pieces is where the complexity comes into play.

While the individual devices can be distributed to other machines, there is always one DCERouter, and there are certain devices which are plugins, which must run in the DCERouter's memory space (because they need access to the others' data structures to intercept messages etc.)

The DCE devices themselves, often wrap other pieces of software, exposing a command interface for the other DCE devices, and there is also a boat load of custom scriptage below which deals with system configuration (well over 260 scripts at current count), so that this system can behave as an appliance.

So, as you can see, very complex, but in a good way. It does mean that research into this area is very much a long term venture.

Thank you, yeah I think I have gripped the complexity of the architecture, meaning I will not be able to learn what everything is doing in this life

So, just to make sure I understood you right. You basically mean that there is no (easy) way of setting up two nodes(core) the same way and having one in like hibernate state and then make it to take the active role when the main Core dies?Off course there will always be a difference in the hw(like mac adresses and such) but excluding that and also only concentrate on the security parts.Sorry for the noobie type questions..

But that's really helpful info since I would more or less waste my time in something that is not really going to work or maybe be more challenged to get it working

Will do some more reading on that page. How could I've missed that one(found it now)

The only "easy" way that springs to mind would be to run your core as dedicated (no MD) as a virtual machine with VMWare. There are multiple options with that platform for regularly taking exact snapshots of VMs, or sharing storage maybe using a shared SCSI bus or iSCSI, even VMotion if you're prepared to go to ESX.

That way you could setup 2 physically separate machines running VMWare and something to perform a heartbeat and trigger failover to the second piece of hardware when needed - as the image is identical, the rest of your LMCE need not know anything. There is even a level of virtualisation of switching/routing in VMWare that would allow the 2 NICs set up to failover as well.

Could be an expensive proposition though, as some of the advanced features are only available in the commercial versions of VMWare...

high availability with Open Source should be quite common. There are several Projects and papers about this issue.I wrote my diploma thesis about that *siiggh*, where i compared three Solutions for a HA-framework.Redundancy in mind, it should be determined first, what's important to keep alive. The core? The DB? The Filesystem?both? How about the TV Cards (keep in mind THE week w/o TV!)?. The easiest way to increase HA is to go for RAID 1, hoping Murphy won't grill both HDs, so you just can keep a second "standby" Core and switch HD, rebuild RAID1.

Things like heartbeat etc could do the trick as well.The "winner" of the conclusion was http://www.openais.org, an opensource implementation of the Service Availability Forum, as the idea was to keep several independant Software Moduls online.The company i worked for never implemented anything based on my thesis .... i wonder why ;O)

high availability with Open Source should be quite common. There are several Projects and papers about this issue.I wrote my diploma thesis about that *siiggh*, where i compared three Solutions for a HA-framework.Redundancy in mind, it should be determined first, what's important to keep alive. The core? The DB? The Filesystem?both? How about the TV Cards (keep in mind THE week w/o TV!)?. The easiest way to increase HA is to go for RAID 1, hoping Murphy won't grill both HDs, so you just can keep a second "standby" Core and switch HD, rebuild RAID1.

Things like heartbeat etc could do the trick as well.The "winner" of the conclusion was http://www.openais.org, an opensource implementation of the Service Availability Forum, as the idea was to keep several independant Software Moduls online.The company i worked for never implemented anything based on my thesis .... i wonder why ;O)

Gee, thanks. . . Well as you say, it has to be decided whats important to protect and to what cost.

When it comes to Alarm which was the main reason of concern I am getting my doubts about an actual need for redundancy. Afterall if I would decide to keep the normal way of opening a door with a key the house is at least not Open.I will definately search further in the matter but it seem more closer to the reality if using raided discs(discs has proven to live shorter) to start with and also a good Ups system.

If Totallymaxed have had a core running for two years (with regular maintenance I presume) then there should at least be a possibility that maybe mine will as well if I just keep my hands off when all is working..

Had my first freeze(small crash) yesterday and that one made me feel more safe with this sw.

after choosing a video(dvd) and then when it started I hit the f7 then menu(to skip the foreplay)Then when going to dvd options and choosing what I wanted it simply closed the video and froze.As an used windose user I felt like, ohh! crap.. now I need to reset it. But then I saw that the mediadirector program closed and I came to the LMCE manager I think it's called. Then after a few seconds the mediadirector came back up again...That impressed me just a little bit(quite much actually) that, to me feels like there are at least some sort of code, checking that all processes are up and if not restart them. wow!

Is there such a "documented" function that overlooks the processes and if one crashes it will be restarted? Or was I just being lucky?

In that case I feel my worry about the Core going to a windos freeze kinda mode is less likely to happen

DCE itself has a watchdog process which watches every thread, and if a particular thread takes too long to execute (60 seconds), it will kill the router, and send a message to the launch manager to restart the DCERouter and the associated devices.

There are also other associated scripts for each of the major daemons, such as Asterisk, MythTV, etc.

people really need to stop, and thank Pluto for designing a system that was ultimately to be used as an appliance. The sheer amount of forethought that went into this system is nothing short of holy shit staggering.

DCE itself has a watchdog process which watches every thread, and if a particular thread takes too long to execute (60 seconds), it will kill the router, and send a message to the launch manager to restart the DCERouter and the associated devices.

There are also other associated scripts for each of the major daemons, such as Asterisk, MythTV, etc.

-Thom

The problem with that though is that this brings the whole system to a halt while the the reload router happens. In some situations this is a pain. ie I'm watching a movie and some device thread dies... and the watchdog decides to reload the router!... my movie playback gets killed for possibly 1-2 mins on a big complex system while the reload happens... then i have to manually restart my movie. Not really very nice at all.

Ideally we need to be able to resurrect a thread without having to restart the whole router to do it...