Adaptive GameTheoretic MultiRobot Coordination Based on Resource Spending Velocity Dan ERUSALIMCHIK , Gal A. KAMINKA and Sarit KRAUS Department of Computer Science, Bar-Ilan University, Israel
Selecting a Coordination Method (Algorithm)
Game Theory Analysis of LCT Tasks
Coordination improves team's performance and increases robustness of team. But, many different coordination methods exist and no one is always best. The effectiveness of a coordination method depends on context, agents density and group size. Therefore, we need technique for select methods dynamically, switching methods as needed [Rybsk et al. 1998, Rosenfeld et al. 2008].
There are common themes that run through all the tasks in which EI has been successful. We refer to these tasks as LCT tasks (Loosecoordination, Cooperative, Timed tasks): ➢ Loose coordination between the robots (i.e., only occasional need for spatial coordination) ➢ A cooperative task (the robots seek to maximize group utility) ➢ The task is bound in time. LCT Task as ExtensiveForm Game
Coordination Algorithm Selection as Reinforcement Learning in Onestage MDPs
Robot 1 selection
Time : C 1
Each agent has a set of actions (coordination methods). ➢ Each agent selects the actions independently of others. ➢ Participants receive greatest reward for joint actions. ➢
Time : C 1 a
Robot 2 selection
Time : C 1
p
C 2−C 1=I 1 I 1
… and …
Time : C 2
No observation of others actions or communication between agents. ➢ All agents anonymous. ➢ Rewards are not known apriory (have to be learned)
Robot 1 selection
Time : C 2
➢
Time : C 2
C i, 0 C i, j
Start
C i, j 1
I i,p j
t i, j
I ai, j
I ai, j 1
Coordination
I i,p j 1
t i, j1
Finish Time
2
Coordination
Team Level:
The EI (Effectiveness Index) Reward Function
p
p
c
1 K 1,1 U 1,1 K 2,1 U 2,1
1
c
1 1 2
U i , j=[ gain I i i gain I j j ]−[C i i C j j ]
2
Individual Level:
1. The frequency of coordinating. Monitor for time from conflict to conflict occurrences and select method that lead to longest period without conflicts. 2. The time spent coordinating. Monitor for time needed for conflict
p k
1 1 1 2 2 K 1,1 u1 1 , K 1,1 u 1 1 1 1 2 2 K 2,1 u 2 2 , K 2,1 u1 1
1
1 1 2
c k
u k k =gain I k −C k
2
solving and select the quickest method. 3. The cost of coordination. Monitor for resource spending and select cheapest method.
Calculatable, Individual Level:
EI of method of robot i th at conflict j
1
1 1 2
1 1 2 EI 1, j 1 , EI 2, j 1 1 2 EI 1, j 2 , EI 2, j 1
2
2 K 1,2 U 1,2 K 2,2 U 1,2 2
2 1 1 2 2 K 1,2 u1 1 , K 1,2 u2 2 1 1 2 2 K 2,2 u 2 2 , K 2,2 u2 2 2
2 1 2 EI 1, j 1 , EI 2, j 2 1 2 EI 1, j 2 , EI 2, j 2
But RL does not converge! Or does it?
a i, j
I cost i EI i , j = I i , j
Simple reinforcement learning, in anonymous matched games, causes the population to converge to Nash equilibrium, even If mixed [Hopkins, 1999].
Stateless QLearning algorithm
Conditions for convergence (satisfied in EIbased adaptation): ➢ Game matrix must be symmetric ➢ The agents use fictitious play or stimulus response learning. This essentially corresponds to our stateless Qlearning ➢ The agents random select their initial methods.
Qt a=Qt−1 a R t a− Q t−1 a
TeamBots Simulated robots Foraging
AIBO Robots Real robots Foraging
VRForces TaskOriented Decision
sim.foraging pic
Real robot pic
VR-Forces pic
Pucks collected per minute
1.4
resource: unlimited
Median
Pucks retrieved
time:20 min,
50
1.2
40
EI ACIM RND noise aggression repel
Dynamic
30 20
Static
10
5
10
15
20
25
Group size
30
35
1.0 0.8 0.6 0.4 0.2
EI
40
Unexpected resource extra spending time:20 min, resource:500 units, extra spending: aggression 0.5 units per step
50
Number of path selection
Pucks collected by group Pucks retrieved
Time :T
Time : T
We can reduce the game tree into a matrix game, where Ki,j is the number of conflicts occurring within total time T that results from the first robot selecting i , and the second robot selecting j . Ui,j is the utility gained from this choice. This utility is defined as:
T
0
Robot A
Noise Coordination Method
Repel
Robot B
Robot C
30
-0.2
Repel
-0.3
-0.5 10
Repel
Repel
-0.4
20
Noise Noise
Noise
-0.6
Learned EI by Robot 5
10
15
20
25
Group size
30
35
40
24
Path 1 (short) 20
Path 2 (long)
15 10 5
Agent B
Agent C
Agent D
See our paper in the
Adaptive & Learning Agents
-0.1
EI ACIM
Learning of the best task solution
Agent A
40
-EI
Pucks retrieved
Robot 2 selection
Time : C 2
From Team Utility to Individual Reward
Conflicts TimeLine
Task
Time : C 2
Time :T
Two states: Coordination and TaskExecution Task
Time : C 2
Higher is better
AAMAS'09 Workshop (and Dan Erusalimchik's M.Sc. Thesis)