BGP route-reflectors and MPLS - suboptimal path.

We have 4 routers physicaly connected in a ring, three of them have a eBGP session with a upsteam ISP.

Two RC-RR's are route-reflectors and all other routers have BGP sessions with them using Loopbacks IP as source.

Because of speed and price the connection RC-E001 <--> RC-RR1 is a backup and OSPF and BGP metric are set accordingly.

The internal routing are working as expected. All routers are MPLS "P" routers, but only Loopbacks IP are label-switched, it means that only traffic to a Loopback follow label-path, other traffic should use normal routing table.

The problem is followin: Traffic to the Internet from the router RC-E001 follows the path RC-E002 ---> RC-RR2 ---> RC-RR1,

but it should just go to the router RC-E002 and then directly to the Internet. All external prefixes on RC-E001 have RC-RR1 as a next-hop (higher local-preference)

Traceroute on RC-E001 shows following:

RC-E001#traceroute 8.8.8.8

1 RC-E002 [MPLS: Label 202 Exp 0] 16 msec 20 msec 60 msec

2 RC-RR2 [MPLS: Label 79 Exp 0] 20 msec 16 msec 20 msec

3 RC-RR1 [AS UPSTREAM] 20 msec 16 msec 20 msec

4 UPSTREAM [AS UPSTREAM] 20 msec 16 msec 20 msec

5 ....

I understand that RC-E001 tries to reach the BGP next-hop via MPLS label-path, bacause all Loopbacks should use MPLS Label path-switching, but I don't want that the traffic goes in such sub-optimal way.

What have I configured wrong and what should I do to force the traffic from RC-E001 goes out direct from RC-E002?

Share:

Replies

The attachment is not showing any complete diagram but just a partial arrow..Can you please provide the updated diagram to better understand the network topology.

Are you using NHS on the RRs as by default NHS will not work for RR even though configured..If RC-E002 also has an upstream ISP Peering along with the RC-RR1 and RC-RR2 then for RC-E001 to choose RC-E001 the BGP attributes for the routes being injected by RC-E002 has to be better than RC-RR1/RR2 so that RC-RR1/RR2 in order to choose RC-E002 as the best route and do not announce their own routes..

If I undertand correctly looking at the topology depicted above RC-E002 is also RR-Client . Am I correct in my understanding. RC-E001 only peers with the RR's RC-RR1 and RC-RR2 right ?

In order for RC-E001 to prefer RC-E002 as the exit point for Internet Traffic we need to make the EC-R002 BGP routes more preferred than RC-RR1 so that RC-RR1 reflects the RC-E002's routes to RC-E001 and do not advertise its own routes. Same can be achieved by increaing LP of RC-E002's routes..

The Problem here is that RC-E002 is not the best candidate for Internet Traffic but instead RC-RR1 is advertising the best routes..LSP comes at 2nd place only after we select the best route from a peer to reach that peer..Even if we peer RC-E002 directly with RC-E001 it will not solve the issue as the default LP for routes learnt from RC-E002 will be 100 and whereas LP of routes learnt from RC-RR1 will be 150 which is better thereby selecting RC-RR1's routes for Internet Traffic.

If there were no MPLS in network, then traffic would be routed hop-by-hop.

It means RC-E001 still sees RC-RR1 as hext-hop for all external prefixes, but on the way to it, traffic would be routed by RC-E002 and RC-E002 would simply send the external traffic directly to Upstream. The LP doesn't play any role here because RC-E001 has no BGP session with RC-E002 and doesn't get any prefixes from it.

But you're right - if I set the BGP session between RC-E001 and RC-E002 then I can set the LP accordingly. I should think about it, because it means a "small" changes in a design . RC-E002 was not supposed to be RR router, but may be it's a good idea to create a second level of RR sessions.

NHS stands for next hop self, of you are using a transit/border router as a RR you might notice some of your next hops get trampled, you can manually fix this with a route-map setting the next hop if you wish

I can't see the diagram (using app) but I would provide good odds that this is it;

When the ingress LSR (the guy you are tracing from) does a lookup it sees the next hop recourses to a label switch path (LSP).

With MPLS forwarding the packet is not routed hop by hop. Actually the packet is put onto a predetermined path to the next hop router in this case.

When the packet arrives here the router sees an IP packet (PHP) and sends the packet on a new predetermined path towards the upstream.

Have a look at the next hop values, I'm confident you'll see this is causing the issue, as mentioned you can manipulate the next hop the RR is advertising to correct the issue.

you're right about LSP, the RC-E001 use the LSP to reach the nex-hop and that is why the external traffic goes to RC-RR1.

But how should I change the Next-hop value - just to set the next-hop IP of RC-E002? I think it's not a best solution, because RC-E001 doesn't have any BGP session with RC-E002. It'll work for sure, but it breaks the design rules.

I was think to configure RC-E002 as router-reflector for RC-E001 and so create an additional Route-reflector level, because as I remember according to BGP best-practice traffic from one route-reflector client should not pass the other route-reflector client.

I'm at home now on a PC and can see the diagrams. I think the issue is just that the best route is via RR1 (and not RR2) hence the traffic gets there based on the best path to the loop0 based on OSPF (and as already mentioned MPLS sends the packet on a prederemined path).

You're asking how to make traffic leave your network to the internet on RR2, you have two options;

- don't use label switching (this may not be an option for you)

- change the BGP attributes so that the best route in the BGP table on E001 is the path via RR1

I think your current local preference scheme is what is causing the route via RR2 to be selected, you could choose to up the local preference to above 150 for just the default route (assuming that's the route you're using) to resolve this.

Note: Varma did say the second solution above earlier

You asked about hierachical route-reflectors also, I don't think this could help you solve the problem. You never mentioned there is an iBGP session between RR1 & RR2, is there? As long as it is iBGP with no RRClient that should be fine and each RR will learn the routes which the other RR learnt through eBGP (that sounds confusing but I'm sure you know what i mean).

If you don't think that's correct or doesn't solve the problem can you post some "show ip bgp 0.0.0.0" (or whatever the route is if you're not using default) on all the routers so we can see whats happening?

On second read over your earlier discussions regarding the BGP local-preference scheme.

Best common practise would dicatate that routers in the same BGP AS should never have routes with different attributes. Ie you should never set local preference in the middle of your network for a specific route and some routes have the old route and some have the changed route.

As you are aware you should also have an iBGP full mesh or at least route-reflectors.

So what you should have is

RR1 iBGP to RR2, E001, E002 (last two are RRClients)

RR2 iBGP to RR1, E001, E002 (last two are RRClients)

Then all you have to do is on the network borders (which in this case are RR1 & RR2) just configure the local-preference on the route as it is learnt inbound (whether from eBGP or redistribution of a static route).

This way all routers will have "congruent routing information" which is what's expected in terms of best common practises.

I'm not too sure about the other background of what you're trying to do here but for example if you also intended some kind of load balancing scheme that could be done too but I don't think it's in the scope of what you were asking.

Anyway sorry for all the spam, let us know how you go after this one =)

Before I go into solution. let me ask you a design question and this will also have the solution within it for your problem.

Q. What happens if the link between RC-E002 <->RC-RR2 and RC-E001 <->RC-RR1 goes down?

You are toast. Although RC-E002 has ebgp with ISP it wont pass the default route to RC-E001 because there is no iBGP between RC-E002 and RC-E001. so literally RC-E001 will not be able to get to the internet. You see what I mean here.

Now, the solution to your problem

Make RC-E002 as a RR with a LP=200 and rest of routers its client. Now, if the above scenario occurs then RC-E001 will still be able to get to the internet.

This will also fix your original problem. Since RC-E002 is the RR with an LP =200 and in a full mesh , it will send that to the other 3 routers(RC-RR1, RC-RR2, RC-E001). Now , RC-E001 will ignore the LP=150 from RC-RR1 and route via RC-E002.

Even if you are using MPLS here it will still be the same as it will just create an LSP between RC-E001 and RC-E002 between the loopbacks and follow that path.

In case RC-E002 dies, then RC-E001 will go to the internet via RC-RR1.

or if the link between RC-E002 and RC-E001 dies then RC-E001 will take the path from RC-RR1<->RC-RR2<->RC-E002 to the internet.

to your question - I'm agree actually, but these links are completely separate, in different locations, different media and different HW. But I see what you mean.

To your suggestion, I'm agree that RC-E001 need somehow a BGP session with RC-E002, but as I said I don't want to put RC-E002 on the same route-reflector level as our main route-reflector router - RC-RR1 and RR-RC2, that is why I was thinking about second level of RR. But I'm not sure if I can/may set an iBGP session to a route-reflectors from different levels.

You can setup different levels of RR's. You can have RC-E001 and RC-E002 in one level and have RC-RR2 and RC-RR1 in another level and then have the RR's talk to each other but to solve your original problem you would still need RC-E002 as a RR.

Please see this below which has a diagram of how to connect different RR's within the same AS.

Edit: Actually looking at your topo RC-E001 might have to be also made an RR -client and also an RR if you want to want multiple clusters. otherwise if RC-E002 dies then the RC-E001 wont be able to go out as it wil become isolated. A RR client in a cluster can have only RR within the same cluster. But if it wants to talk to differnet clusters then it needs to be a RR as well.

Honestly, having multiple clusters in your topology might not be beneficial and it will make it more messy in my opinion. Please stick to the solution i provided in my prev post and have a flat network

exactly, RC-E002 will be RR router but on the higher level of RR topology like this:

the question is: is it possible or is it allowed to set RC-E001 as route-reflector client of RR routers RC-E002 and RC-RR1 which are on different levels? I'm not quite sure how the BGP information will be exchange in this case.

The answer is NO. I explained that in my prev post in the Edit: message

Also in your diagram you have RC-E002 as a RR-client and RC-E001 also as a client to this client. You can't have a RR client to another RR client.

What you can do though is to make RC-E002 an RR and then make RC-E001 its client and also and RR. Then you can have a RR <->RR relationship between RC-E001 and RC-RR1 and this is allowed. Does it make sense?

I had edited my prev post which talks about this concept of communication between clusters.

I agree with you to some extent, but why do want to have RC-E001 as route-reflector and RC-E002 as it client and not other way around?

I should clarify that RC-E002 is truly backbone router and RC-E001 is let say more or less a "stub router".

Kishore Chennupati wrote:

Also in your diagram you have RC-E002 as a RR-client and RC-E001 also as a client to this client. You can't have a RR client to another RR client.

I think Matt points to this post, actually I'm disagree with this statements as well. I din't use a hierarchical RR topology in a proctice but pretty all MPLS topologies somehow refers to hierarchical RR topology.

" What you can do though is to make RC-E002 an RR and then make RC-E001 its client and also and RR. Then you can have a RR <->RR relationship between RC-E001 and RC-RR1 and this is allowed. Does it make sense?"

" What you can do though is to make RC-E002 an RR and then make RC-E001 its client and also and RR. Then you can have a RR <->RR relationship between RC-E001 and RC-RR1 and this is allowed. Does it make sense?"

Kishore,

that is exactly the point which I'm not quite sure I correctly understand you.

You mean I can set RC-E002 as RR router and in the same time it stays as route-reflector client for RC-RR1 and RC-RR2? Do I understand you correctly?

Then I configure RC-E001 as a route-reflector client for RC-E002?

But how can I configure "RR relatioship" between RC-RR1 and RC-E001? Do you meanto configure a simple iBGP session between them? I don't think it's a good idea - we don't get a full-mesh there if RC-E002 (RR router for RC-E001) failed.

Yes, you can add a level of hierarchy to route reflects where one route reflector is the client of another BUT just because you can bend the rules of iBGP split horizon like this doesn't mean you should and I don't believe it is required, nor is it good practice.

It doesn't matter that there is no iBGP between E001 and E002, they should both be clients of the same RRs and the route will propagate via both RRs and back down. Unfortunately because these RRs also have their own route with better attributes the route will never get advertised, this is why setting the local preference was mentioned.

I'm sure your issue is with route selection on the route reflectors, ultimately there are many ways to solve any problem like this.

@Kishore - I don't disagree with you at all, we are just talking about different topologies :)

@Konstantin - you don't need a BGP session between E001 & E002, the fact that we are adding sessions to solve the problem proves that either

- the BGP best route selection is not being controlled and the desired route is not advertised

- the BGP topology does not equate to a full mesh (or equivalent with RRs)

How about try this, let's simplify the network for now. Make only one router a RR, it can be any and will need a session to each of the othe three. Now have all routers inject routes into BGP and use local preference (at the ingress) to control the best route then view this on the RR.

The label switching behaviour is just a side affect of the best BGP route being the one you don't desire.

One you've got that workin with one RR you can make a second one for redundancy as described in my earlier post.

it's correct, I know it, but the problem not in the best-path. The problem is how to reach the next-hop from RR-client in case of MPLS and in case of non-MPLS backbone.

In my topology I've broken one of RR topology best practice - never set a BGP session from RR-client to a RR-router over another RR-client. In case of non-BGP backbone it's not so obvious but MPLS backbone has showed me that this rule makes sence.

"The problem is how to reach the next-hop from RR-client in case of MPLS and in case of non-MPLS backbone."

What we are trying to say is that one RRClient can't learn the next hop of the other RRClient due to BGP best path selection on the RR and that this need to be controlled so that the route with the next hop of the desired RRClient is advertised. Yes, the behaviour changes when you introduce MPLS, but actually this is BGPs fault and only working without MPLS because IP routing is decided hop by hop. Really the issue here is a failure in the BGP design, MPLS forwarding is just the victim protocol.

I think just about everyone on here is trying to give you more or less the same solution;

- I am saying control best path selection with LP

- Sergey is saying the route isn't advertised due to eBGP > iBGP (and hence is not best route)

- Kishore was asking you to move the RR to E002 which would just be another way to control the best path without setting LP

- Varma was talking about LP as well

How about this, prove us wrong. Post the "show ip bgp summary" and "show ip bgp " from each router and explain how the suboptimal routing is being caused by anything other than the BGP best path selection on the RR.

You haven't mentioned it but if you want the RR to be either of the current ones while allowing E001 to reach the Internet via E002 while maintaining that RR1 and RR2 still use their directly connected egress, you'll need to do something a little more complicated, either;

- change the node which acts as the RR to E002 so that the only path E001 learns in a stable topology is via E002

- change the node which acts as the RR to E001 so that E001 learns all paths and decides hopefully using lowest cost IGP metric or perhaps LP (it's dangerous to let BGP decide on its own if you have your own policy in mind)

Is this why you don't want to change BGP? So that you don't influence other routers egress?

Yes, the behaviour changes when you introduce MPLS, but actually this is BGPs fault and only working without MPLS because IP routing is decided hop by hop. Really the issue here is a failure in the BGP design, MPLS forwarding is just the victim protocol.

I see your point and I 100% agree with it, MPLS is forwarding the traffic exactly in the way how BGP want it, in my case unfortunately a little bit wrong.

How about this, prove us wrong. Post the "show ip bgp summary" and "show ip bgp " from each router and explain how the suboptimal routing is being caused by anything other than the BGP best path selection on the RR.

"sh ip bgp summ" an all routers shows two iBGP sessions with both RR's and with Upsteam router.

on RC-E001 only two iBGP sessions with both RR's.

"show ip bgp 8.8.8.8" on all routers shows that the best-path is via Upstream router.

on RC-E001 it shows 2 paths - via RC-RR1 and RC-RR2, and as weight parameter is higher for BGP session with RC-RR1, it chooses as best path. But because of IGP metrics the RC-RR1 is reachable via RC-E002 and not directly via backup connection.

You haven't mentioned it but if you want the RR to be either of the current ones while allowing E001 to reach the Internet via E002 while maintaining that RR1 and RR2 still use their directly connected egress, you'll need to do something a little more complicated, either;

- change the node which acts as the RR to E002 so that the only path E001 learns in a stable topology is via E002

- change the node which acts as the RR to E001 so that E001 learns all paths and decides hopefully using lowest cost IGP metric or perhaps LP (it's dangerous to let BGP decide on its own if you have your own policy in mind)

Is this why you don't want to change BGP? So that you don't influence other routers egress?

are our backbone routers with upstream eBGP and at the same time they are "P" routers for our MPLS network.

As Full-Meshed is not really possible in our case (too many BGP session) RC-RR1 and RC-RR2 were choose as route-reflector for Internet routing (MPLS route-reflectors are outside of scope of this discussion) because of their location and performance. All routers have a direct "physical" connection to both of RR's. All routers should primarily use their own upstream link for external communication.

sometimes ago we've added RC-E001 in our network, but RC-E001 doesn't have an upstream and it's more or less stub router but it still needs full-BGP table. it has a direct physical connection only to RC-E002 (primary link) and to RC-RR1 (secondary link because of price and bandwidth).

I can't simply just move one RR to RC-E002, it means to re-configure 10 routers.

I don't think it's a good idea to put a third RR in a network - it will unnecessary increase the number of routing information on all routers.

I can't configure RC-E001 as Route-reflector, because it's more like a "stub" router

you're right with a route-map next-hop setting, I need then some kind of tracking.

I've came also to the idea of iBGP between RC-E001 and RC-E002 but couldn't find any "pro and contra" about iBGP session between route-reflector clients, it seems that not so many people have tried this.

When you contrast a full mesh with a RR design, really it is about reducing the amount of sessions. You can add more sessions it shouldn't have a negative impact. The trap would be using the route-reflector-client in too many places but were not adding it in anywhere so were safe