Flink taskmanager on ECS cannot connect to jobmanager on EC2

Question

I have an EC2 instance which is in us-east-1b and is running the flink jobmanager, which is responsible for coordinating work across multiple taskmanagers via RPC, and history server. I can see from netstat that the jobmanager is listening on :::6123 for incoming taskmanager connections.

I have an auto scaling group which will run up an EC2 instance into the same az, subnet and security group as the EC2 instance.

The security group allows All Traffic on all ports from any source in the group to any destination in the group:

I'm using that ASG as a capacity provider for ECS tasks. I'm then trying to run up a task in ECS that runs the taskmanager and uses that ASG.

The taskmanager starts up, but won't connect to the jobmanager:

2021-09-28 13:52:08,651 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Could not resolve ResourceManager address akka.tcp://flink@ip-xxx-xx-x-xxx.ec2.internal:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@ip-xxx-xx-x-xxx.ec2.internal:6123/user/rpc/resourcemanager_*.

I've ssh-d onto the instance run up by the ASG and confirmed that I can curl the jobmanager on ip-xxx-xx-x-xxx.ec2.internal:8081 - it works. So I know that the taskmanager instance can see the jobmanager instance.

To summarise:

The taskmanager and jobmanager are in the same VPC, the same AZ, the same subnet and the same security group
The security group allows all inbound traffic from sources in the same security group
The security groups allows all outbound traffic to any destination
The jobmanager is running on an EC2 instance manually created
The taskmanager is running on an EC2 instance created as part of an ASG by ECS. The taskmanager runs in a container on ECS
I can curl the jobmanager from the taskmanager node
The taskmanager and jobmanager communicate over RPC
The taskmanager won't resolve the address to the jobmanager

Why won't my task connect? I've also tried the public IP (v4) and the private IP (v4).

Please edit your question to add a screenshot of the SG in / out rules, and also describe what the task manager / job manager are and where they run. Try adding your security group to its own inbound rule, that allows instances in the same group to communicate with each other. Key thing to understand is an SG is a firewall around each network interface, it's not like a traditional subnet where everything can communicate by default. — Tim, Sep 28 '21 at 17:20
Add an outbound security rule to the same SG. Allowing 0.0.0.0/0 is different from allowing access to a specific security group. Not 100% sure this is the solution, but you haven't provided enough information for me to be sure. Also, when doing SG screenshots you should include the top part which shows the SG ID. Your first inbound SG covers all traffic, the others are redundant. — Tim, Sep 28 '21 at 20:22
I added an outbound rule to the same SG allowing all traffic. There is no change. I can ping the ECS instance from the EC2 instance and visa versa. I appreciate your help, and I've given you everything you've asked for so far. If you still need more information I'd be happy to put it in there. — ndtreviv, Sep 29 '21 at 14:35

score 1 · Accepted Answer · answered Sep 29 '21 at 15:10

Today I discovered why this wasn't working.

The jobmanager was configure with:

jobmanager.rpc.address: localhost

and so, whilst listening on the right rpc port, was not accepting traffic to any other address.

When I changed it to match the taskmanager:

jobmanager.rpc.address: ip-xxx-xx-x-xxx.ec2.internal

then the task manager connected immediately.

Flink taskmanager on ECS cannot connect to jobmanager on EC2

1 Answers1