## Game trees under a partial order part 3: Alpha Beta pruning

In this last post of this series of three I will discuss alpha beta pruning. The context will be evaluation functions on game trees that take values in a distributive lattice. Of course a prime example would be the linear lattice $\mathscr{L}(X, \leq)$ over the partially ordered set $(X, \leq)$ but this post does not depend on specific features of this linear lattice. The result is a description of alpha beta pruning for distributive lattices that is a slightly improved version of that given by Ginsberg et al in their paper “Alpha Beta pruning under partial orders” from 2001.

Alpha Beta pruning is a way to compute the value of an evaluation function on a game tree for the root node by visiting as little child nodes as necessary. See the previous post for a more extensive introduction to evaluation functions. Here I will only briefly recall the recursive definition of an evaluation function $f$ on a game tree that takes values in a lattice $L$. I will assume that $L$ has a minimal and maximal element denoted by $\mathbf{0}$ and $\mathbf{1}$. The normal forms $\textrm{CNF}(X, \leq)$ and $\textrm{DNF}(X, \leq)$ of the linear lattice are examples of this.

1. The value of $f$ for a leaf node is given by some heuristic game evaluation function (e.g. chess piece values).
2. At an inner min node (where the minimizing player is to make a move) the value of $f$ is defined recursively as the greatest lower bound in $L$ of all values of $f$ on direct child nodes.
3. At an inner max node the value of $f$ is defined as the least upper bound of all values of $f$ on direct child nodes.

This definition leads to the following algorithm for the computation of $f$. This algorithm is also the basis of further considerations on ways to prune parts of the game tree that are irrelevant for the computation of $f$ on the root node.

function f(node)
local value
if node is a leaf node then
value = heuristic position value of node
elseif node is a min node then
value = 1
for child in all children of node do
value = value ∧ f(child)
end
else
value = 0
for child in all children of node do
value = value ∨ f(child)
end
end
return value
end

The structure of this naive algorithm is simple. To compute the value at a min node start with the maximal value $\mathbf{1}$ then iterate over all child nodes and adjust the node’s value by taking the greatest lower bound with the value of a child node. At a max node the procedure is similar but starts off with the minimal value $\mathbf{0}$ and repeatedly takes the least upper bound. The computation recurses depth first left to right over all child nodes, where “left to right” is according to the enumeration order of child nodes. Move ordering does not play a role in this naive algorithm but it can play a big role in alpha beta pruning in terms of the number of pruned nodes. I will not address move ordering in this post. During execution of the algorithm the state of the computation is schematically depicted below. As in the example in the previous post I will assume that the top node is a min node.

Intermediate values in the depth first left to right computation of an evaluation function

Here $x_1, x_2, \dotsc$ and $y_1, y_2, \dotsc$ are the intermediate values that are computed for the inner min and max nodes respectively. The $\star$ symbol indicates all subtrees located to the left of the current path of computation in the tree. These are the subtrees for which a value is already (recursively) computed. When the computation progresses each intermediate value $x_j$ descends from $\mathbf{1}$ and each $y_j$ ascends from $\mathbf{0}$. When we would cut the computation short at the point depicted above then the algorithm would track back up and yield the final outcome

$x_1 \wedge (y_1 \vee (x_2 \wedge \dotsc \vee (x_n \wedge y_n)))$

at the root node. The essential point of alpha beta pruning is that it identifies when this intermediate value remains constant even when $y_n$ gets larger in the remainder of the computation. Because in that case any further children of the node marked $y_n$ in the diagram will not contribute to the final result and can be pruned (skipped). In the picture above the search path ends with a max node but that is irrelevant for alpha beta pruning as we will see.

A lattice $L$ has a canonical partial order where $x \leq y$ is equivalent to $x \wedge y = x$ (or alternatively $x \vee y = y$). The lattice operators $\wedge$ and $\vee$ indeed result in the greatest lower bound and least upper bound with respect to this ordering. The following lattice property is the key point in alpha beta pruning.

Lemma: When $x, y, z \in L$ and $y \leq z$ then $x \wedge y \leq x \wedge z$ and $x \vee y \leq x \vee z$.

Since $y \leq z$ the greatest lower bound $x \wedge y$ is also a lower bound for $x$ and $z$ and is therefore less than or equal to the greatest lower bound $x \wedge z$. The other inequality is derived similarly. For sequences $x_1, x_2, \dotsc$ and $y_1, y_2, \dotsc$ in $L$ define sequences $\alpha$ and $\beta$ recursively by

1. $\alpha_1 = x_1 \wedge y_1$ and $\alpha_{j+1}$ is obtained from $\alpha_j$ by the substitution $y_j \leftarrow y_j \vee (x_{j+1} \wedge y_{j+1})$.
2. $\beta_1 = x_1$ and $\beta_{j+1}$ is obtained from $\beta_j$ by the substitution $x_j \leftarrow x_j \wedge (y_j \vee x_{j+1})$.

So the sequence $\alpha$ starts with

$\alpha_1 = x_1 \wedge y_1$
$\alpha_2 = x_1 \wedge (y_1 \vee (x_2 \wedge y_2))$
$\alpha_3 = x_1 \wedge (y_1 \vee (x_2 \wedge (y_2 \vee (x_3 \wedge y_3))))$
$\vdots$

and the sequence $\beta$ with

$\beta_1 = x_1$
$\beta_2 = x_1 \wedge (y_1 \vee x_2)$
$\beta_3 = x_1 \wedge (y_1 \vee (x_2 \wedge (y_2 \vee x_3)))$
$\vdots$

The following inequalities are a direct consequence of our lemma above. The sequences $\alpha$ and $\beta$ satisfy

$\alpha_j \leq \alpha_{j+1} \leq \beta_{j+1} \leq \beta_j$

for all indices $j$. In particular $\alpha$ is an ascending sequence, $\beta$ is a descending sequence, each $\alpha_j$ is a lower bound for $\beta$ and each $\beta_j$ is an upper bound for $\alpha$. If $\alpha_j = \beta_j$ or $\alpha_j = \beta_{j+1}$ for some index $j$ then both sequences are constant from that point onwards. The values in the sequences $\alpha$ and $\beta$ are precisely those obtained from aborting the computation of the evaluation function for the root node at min node $j$ (sequence $\alpha$) or at max node $j$ (sequence $\beta$). Hence further nodes can be pruned from the computation as soon as both sequences yield equal values.

Alpha Beta pruning (strong version): At the first index $j$ where either $\alpha_j = \beta_j$ or $\alpha_j = \beta_{j+1}$ further subnodes of max node $j$ or min node $j+1$ can be pruned.

Note that this formulation of alpha beta pruning resembles “classical” alpha beta pruning for totally ordered sets in that we have an ascending sequences of high values and a descending sequence of low values and nodes can be pruned as soon as these values coincide. There is also an unfortunate difference between the two. In classical alpha beta pruning the high and low values are easy to compute progressively as a simple maximum or minimum value. In contrast high and low values in the pruning algorithm above are not so simple to compute. Since they are computed deepest node first it requires constant back tracking. Moreover, computation of $\wedge$ and $\vee$ can be much more expensive than taking simply the maximum or minimum of two numbers.

Also note that the distributive property of the lattice $L$ is not used at all in this version of alpha beta pruning! This does not contradict the results of Ginsberg et al since they formulate alpha beta pruning (in particular “deep pruning”) differently. I will briefly come back to this point at the end of this post.

There is a weaker version of alpha beta pruning based on alternative sequences $\alpha'$ and $\beta'$ that are easier to compute. The consequence is however that it may take longer before the ascending high values and descending low values allow cut off, resulting in less aggressive pruning. The weaker estimates for high and low values follow from the following identities (where the distributive property of $L$ is now essential). Let $y_0 = \mathbf{0}$ and define two new sequences $\alpha'$ and $\beta'$ by

$\alpha'_j = y_0 \vee \dotsc \vee y_{j-1}$ and
$\beta'_j = x_1 \wedge \dotsc \wedge x_j$.

So $\alpha'_j$ is the least upper bound of the values of all max nodes above min node $j$ and $\beta'_j$ is the greatest lower bound of the values of all min nodes above max node $j$. If we take $\alpha_0 = \mathbf{0}$ and $\beta_0 = \mathbf{1}$ then for all indices $j \geq 1$ the following equalities hold:

$\alpha_j = \alpha_{j-1} \vee (\beta'_j \wedge y_j)$ and
$\beta_j = \beta_{j-1} \wedge (\alpha'_j \vee x_j)$.

These can be proved by induction and the distributive property of the lattice $L$. Taking $y_j = x_j$ in the first equation and $x_j = y_{j-1}$ in the second results in

$\beta_j = \alpha_{j-1} \vee \beta'_j$ and
$\alpha_{j-1} = \beta_{j-1} \wedge \alpha'_j$.

Comparing these equalities it follows that $x_j \leq \alpha'_j$ implies $\beta_j = \alpha_{j-1}$ and $\beta'_j \leq y_j$ implies $\alpha_j = \beta_j$. This leads to the following version of alpha beta pruning.

Alpha Beta pruning: If the value $x_j$ of min node $j$ gets less than or equal to $\alpha'_j$ or the value $y_j$ of max node $j$ gets greater than or equal to $\beta'_j$ then further child nodes can be pruned.

This is a slightly different formulation of alpha beta pruning than Ginsberg et al use in their paper. What they call (shallow or deep) pruning comes down to either $x_j \leq y_k$ for some $k < j$ at min node $j$ or $x_k \leq y_j$ for some $k \leq j$ at max node $j$. These are stronger conditions than those we found above and will therefore prune fewer nodes in general.

Here is an algorithmic formulation of alpha beta pruning. The value of the root node is computed by f(root, 0, 1).

function f(node, α, β)
local value
if node is a leaf node then
value = heuristic position value of node
elseif node is a min node then
value = 1
for child in all children of node do
value = value ∧ f(child, α, β ∧ value)
if value ≤ α then break end
end
else
value = 0
for child in all children of node do
value = value ∨ f(child, α ∨ value, β)
if value ≥ β then break end
end
end
return value
end

To sum up the results of this post: I described two versions of alpha beta pruning that allow to prune nodes from the computation of an evaluation function with values in a distributive lattice at the root node. The first, strong, version has the disadvantage that it is computationally more difficult than the second. Both versions are stronger (potentially prune more nodes) than the method described in the paper by Ginsberg et al.

## Game trees under a partial order part 2: Evaluation functions

The first part of this series introduced the maximizing and minimizing orders on finite antichains in any partially ordered set $(X, \leq)$. In this second part I will show how these orders and their properties are used to define evaluation functions on game trees. The previous post was mainly based on the paper “Searching game trees under a partial order” by P. Dasgupta et al. In this second post I will show how it is related to another paper “Alpha-Beta pruning under partial orders” by M. Ginsberg et al from 2001. This second paper has an interesting different approach. Instead of starting with a partial order from the beginning it investigates what structure an evaluation function on a game tree must have to satisfy some reasonable constraints. The answer they find is that such an evaluation function must be defined on a lattice, that is, a set with operators “greatest lower bound” $\wedge$ and “least upper bound” $\vee$. An evaluation function is then defined recursively, starting at the leaves. The minimizing player combines the values of subnodes with $\wedge$ while the maximizing player uses $\vee$. Moreover, Ginsberg et al show that such an evaluation function allows alpha beta pruning (which I will define in detail in the next post) exactly if the lattice is distributive. While the paper by Ginsberg references that of Dasgupta the relation between the two is not made very explicit. This post will give a very precise and (I think) satisfying relation between the two papers ultimately improving the results of both: The work of Dasgupta is formalized in terms of lattice theory and can now be applied to any partially ordered set, not just $\mathbb{R}^n$ and the result of Ginsberg on alpha beta pruning is sharpened in the next post, potentially leading to more aggressive pruning in game tree searches.

Consider a game tree in which a min and max player make moves in turn. Suppose that leaf nodes in this tree get a value in the partially ordered space $X$ assigned to them. Is there a way, given the values at the leaves, to define an evaluation function on the entire tree? Let’s first assume that both players in fact use an (unspecified) linear extension $\prec$ of the partial order $\leq$ to select their optimal move. (Dasgupta et al call such a linear extension a strategy. See the intermezzo below for why it is fair to assume that both players use the same strategy.) Here is an example how this linear extension can be used to define an evaluation function on a tree of positions. The symbols “$\bullet$” at the leaf nodes indicate some (not necessarily equal) values in $X$. The greatest lower bound and least upper bound $\wedge$ and $\vee$ are taken with respect to the linear extension $\prec$.

An evaluation function based on a linear extension (or strategy)

In this example the minimizing player is to move and the tree represents possible move developments that are considered. The value at inner nodes is defined recursively based on the linear extension: The minimizing player takes the minimum (the greatest lower bound) of the subnodes while the maximizing player takes the maximum (the least upper bound). This leads to a value for the top node in $X$ and the minimizing player will select a move that leads to this minimum.

Intermezzo. Assuming that the min player in this example uses an unspecified linear extension (or “strategy” as Dasgupta calls them) to select moves is central in our considerations. But why is it fair to assume that the max player uses the same strategy? The reason is simple. Suppose the max player uses a different strategy or in fact any system at all (even random decisions) to select moves. In that case the altered evaluation function would assign to each node a value that is less than or equal to the value obtained from equal strategies (when compared with the min player’s linear extension). This follows from the recursive definition of the evaluation function. Since the min player’s goal is to reach a minimal value, the worst case scenario is to assume that the opponent uses the same strategy. End of intermezzo.

Any evaluation function defined as above still depends on the choice of linear extension, so different linear extensions give different evaluation functions. What we are looking for instead is a way to define a single evaluation function based only on the given partial order on $X$. There is a nice trick to turn the example above into exactly such a function, but it requires to alter the target space. To define the new target space start with $\mathscr{F}(X, \leq)$, the space of all functions that map linear extensions of $\leq$ to $X$. The evaluation defined above at any node can be regarded as an example of such a function by treating the linear extension $\prec$ in its definition not as fixed but as a variable. Let me make this more precise based on the following observations:

1. There is a canonical embedding $\iota: X \hookrightarrow \mathscr{F}(X, \leq)$ as constant functions: $\iota_x(\prec) = x$ for all linear extensions $\prec$.
2. If $f, g \in \mathscr{F}(X, \leq)$ then define an element $f \wedge g \in \mathscr{F}(X, \leq)$ by $(f \wedge g)(\prec) = f(\prec) \wedge g(\prec)$ where $\wedge$ in the right hand side refers to the minimum with respect to $\prec$.
3. If $f, g \in \mathscr{F}(X, \leq)$ then define an element $f \vee g \in \mathscr{F}(X, \leq)$ by $(f \vee g)(\prec) = f(\prec) \vee g(\prec)$ where $\vee$ in the right hand side refers to the maximum with respect to $\prec$.
4. Since any totally ordered set is a distributive lattice with respect to its minimum and maximum operators $\wedge$ and $\vee$ it follows that the definitions above turn $\mathscr{F}(X, \leq)$ into a distributive lattice and therefore also into a partially ordered set.

The entire space $\mathscr{F}(X, \leq)$ is too big for our purpose, therefore I define $\mathscr{L}(X, \leq)$, the linear lattice over $(X, \leq)$ as the smallest sublattice of $\mathscr{F}(X, \leq)$ that contains $X$. Then $\mathscr{L}(X, \leq)$ is still a distributive lattice. The linear lattice $\mathscr{L}(X, \leq)$ will serve as the target space for evaluation functions on game trees under the partial order $(X, \leq)$.

Remark. The set $X$ embeds into $\mathscr{L}(X, \leq)$ as a partially ordered set but not always as a lattice. For example if $x, y \in X$ have a greatest lower bound $x \wedge y \in X$ then $\iota_{x \wedge y}$ is a lower bound for $\iota_x$ and $\iota_y$ but not necessarily the greatest lower bound. End of remark.

Example. If $(X, \leq)$ is a totally ordered space then $\mathscr{L}(X, \leq) = (X, \leq)$. End of example.

An evaluation function on a game tree is now defined recursively, as in the picture above, as follows.

1. Assign values in $X \subseteq \mathscr{L}(X, \leq)$ to the leaf nodes based on some heuristics.
2. Define the value of an inner min node as the greatest lower bound in $\mathscr{L}(X, \leq)$ of the values of all its direct successors.
3. Define the value of an inner max node as the least upper bound in $\mathscr{L}(X, \leq)$ of the values of all its direct successors.

Note that this construction is completely in line with the results of Ginsberg et al. The fact that $\mathscr{L}(X, \leq)$ is a distributive lattice even implies that alpha beta pruning should be applicable in the recursive computation of the evaluation function. I will address that part in the next post.

The linear lattice may seem rather abstract but the fact that it is distributive and generated by $X$ means that each element can be put in both a disjunctive and conjunctive normal form built from elements of $X$ only. These normal forms can be described directly in terms of minimizing and maximizing orders as introduced in the first post. This will identify the linear lattice with what Dasgupta et al call “dominance algebra”. Define two partially ordered spaces, suggestively called $\textrm{CNF}(X, \leq)$ and $\textrm{DNF}(X, \leq)$, by applying the star order extension twice as in the picture below:

Two step construction of DNF (top) and CNF (bottom)

So $\textrm{CNF}(X, \leq)$ is the set of finite antichains in $(X^{\ast}, \leqslant)$ ordered by their maximizing order and $\textrm{DNF}(X, \leq)$ the set of finite antichains in $(X^{\ast}, \eqslantless)$ ordered by their minimizing order. If $\prec$ is a linear extension of $\leq$ I denote it’s minimum and maximum operators by $\curlywedge$ and $\curlyvee$ respectively to distinguish them from the lattice operators in CNF and DNF. The following statements identify the normal forms of $\mathscr{L}(X, \leq)$.

CNF: For any partially ordered space $(X, \leq)$ the space $\textrm{CNF}(X, \leq)$ is a lattice with minimal and maximal elements $\mathbf{0} = \{\varnothing\}$ and $\mathbf{1} = \varnothing$ respectively. The sublattice $\textrm{CNF}(X, \leq) \setminus \{\mathbf{0}, \mathbf{1}\}$ is isomorphic to $\mathscr{L}(X, \leq)$ via the rule $x(\prec) = \curlywedge \{ \curlyvee s \mid s \in x\}$.

DNF: For any partially ordered space $(X, \leq)$ the space $\textrm{DNF}(X, \leq)$ is a lattice with minimal and maximal elements $\mathbf{0} = \varnothing$ and $\mathbf{1} = \{\varnothing\}$ respectively. The sublattice $\textrm{DNF}(X, \leq) \setminus \{\mathbf{0}, \mathbf{1}\}$ is isomorphic to $\mathscr{L}(X, \leq)$ via the rule $x(\prec) = \curlyvee \{ \curlywedge s \mid s \in x\}$.

I will only show the CNF case. In the previous post we saw that any maximizing order always has a greatest lower bound operator $\wedge$ and it also has a least upper bound operator $\vee$ if its base space has one. Since in particular $(X^{\ast}, \leqslant)$ always has a least upper bound operator, this shows that $(X^{\ast \ast}, \leqslant)$ is a lattice. The empty set (which is a finite antichain) is maximal for any maximizing order and minimal for any minimizing order. Therefore the elements $\mathbf{0} = \{\varnothing\}$ and $\mathbf{1} = \varnothing$ are minimal and maximal elements for this lattice respectively. A simple verification shows that the given evaluation rule at least defines a surjective homomorphism onto $\mathscr{L}(X, \leq)$ so it remains to prove that this homomorphism is also injective (meaning that the conjunctive normal form is unique). Suppose $A, B \in \textrm{CNF}(X, \leq)$ are such that $A(\prec) = B(\prec)$ for all linear extensions $\prec$. I will show that this implies $A \subseteq B$ (which suffices since by symmetry then also $B \subseteq A$ and $A = B$).

Let $a \in A$. For any $x \in a$ there exists a linear extension $\prec$ such that

1. $x$ is maximal in the lower set ${\downarrow} a$ and
2. if $y \in X$ and $y \prec x$ then $y \in {\downarrow} a$.

This implies that $A(\prec) = x$ since, being an antichain, no element of $A$ other than $a$ itself can be entirely contained in ${\downarrow} a$ and therefore must have a maximum greater than $x$. Then by assumption also $B(\prec) = x$. In particular all $b \in B$ with $b \leqslant a$ must contain $x$ and at least one such element must exist in $B$. Since this holds for any $x \in a$ it follows that $a \subseteq b$ for all $b \in B$ with $b \leqslant a$. This implies that $a = b$ and hence $a \in B$.

Summing up the results of this post: The linear lattice $\mathscr{L}(X, \leq)$ is the natural target space for evaluation functions on game trees based on heuristic evaluation of positions with values in the partially ordered set $(X, \leq)$. Both lattices $\textrm{CNF}(X, \leq)$ and $\textrm{DNF}(X, \leq)$ (minus their maximal and minimal elements) are isomorphic to the linear lattice $\mathscr{L}(X, \leq)$ and explicitly represent elements in their unique conjunctive and disjunctive normal form respectively. In particular the CNF and DNF lattices are themselves distributive and isomorphic. The isomorphisms both ways correspond to what in the paper by Dasgupta et al is called “MAX to MIN” and “MIN to MAX”.

In the next and final post in this series I will discuss alpha beta pruning for game trees under a partial order.

## Game trees under a partial order part 1: Partial orders

I’ve been interested in the subject of alpha-beta pruning of game trees for quite some time now. It has always struck me as unsatisfactory however that it relies on a total order of game positions (in the form of an evaluation function). It would be much more natural to evaluate game positions based on several distinct features and select best moves on this richer set of information. This means essentially that the total order of game positions is replaced by a partial order. There are a number of articles that explore this idea. The theory is, somewhat surprisingly maybe, quite a bit more involved than it is for total orders. The problem is that with a partial order there is now often not a single best move to choose from and the bookkeeping to handle all these choices is rather intricate. This first post in a series of three is based on a foundational paper “Searching game trees under a partial order” by P. Dasgupta et al, published in 1996. The motivation for these posts was firstly because it is a fascinating subject but also because I found the paper rather difficult to read and understand.  I think that a more mathematical style would help the overall presentation of the subject and  that is what I will try to accomplish in this series of blogs. This first post concentrates on the concept of partial orders underpinning the concept of game tree evaluations later on. That said, this post is also interesting in itself as a result in the theory of partial orders, even without the context of its application in game trees.

Let $(X, \leq)$ be a partially ordered set. It suffices to consider only finite sets $X$ for our application but the presented theory applies to any set and we will not assume finiteness. I will write $a < b$ as shorthand for $a \leq b$ and $a \neq b$. Our goal in this post is to introduce a sensible way to compare finite (possibly empty) antichains in $X$. An antichain is a subset of $X$ in which any two distinct elements are unordered. Recall that a linear extension $\prec$ of $\leq$ is a total order on $X$ that is consistent with $\leq$ in the sense that  $x < y$ implies $x \prec y$. Throughout this series of posts the symbol $\prec$ always indicates a linear extension and $\curlywedge$ and $\curlyvee$ denote the minimum (greatest lower bound) and maximum (least upper bound) for this linear extension respectively.

A subset $S \subseteq X$ is called a lower set if for all $x \in S$ and $y \in X$ the inequality $y \leq x$ implies that $y \in S$. Any subset $S \subseteq X$ can be extended to a smallest lower set containing $S$ by the definition

${\downarrow}S = \{ y \in X \mid y \leq x$ for some $x \in S\}$.

Note that if $S, T \subseteq X$ are antichains such that ${\downarrow} S = {\downarrow} T$ then $S = T$. Similarly one can define upper sets and the smallest upper set ${\uparrow} S$ containing $S$.

Lemma: Let $S_1 \subseteq \dotsc \subseteq S_n = X$ be nested lower sets. Then there is a linear extension $\prec$ of $\leq$ such that for all indices $j < k$ and all elements $x \in S_j$ and $y \in S_k \setminus S_j$ the strict inequality $x \prec y$ holds.

If we take $n = 1$ then this lemma simply states that there exists a linear extension. A more important consequence is the following.

Corollary: If $S \subseteq X$ is an antichain then there exists a linear extension $\prec$ such that for any $x \in S$ and $y \in X$ the inequality $y \prec x$ implies that $y \in {\downarrow} S$. Moreover, for any $x \in S$ this extension can be chosen such that $x$ is maximal in ${\downarrow} S$.

The first part follows by taking $n=2$ and the lower set $S_1 = {\downarrow} S$ in the lemma. For the second part take $n = 3$ and the two lower sets $S_1 = {\downarrow}(S \setminus \{x\})$ and $S_2 = {\downarrow} S$. Since $x \in S_2 \setminus S_1$ it is greater than all other elements in $S$ for the resulting linear extension.

Linear extensions are in a sense the “hidden variables” in the theory of game trees under a partial order. If game positions are only partially ordered then searching for moves will most likely not result in a single best move but in a finite set of best moves. Nevertheless a player must still choose a single move from such a set. I will make the fundamental assumption that both players use an otherwise unspecified linear extension $\prec$ to make their choice. (Dasgupta et al call the linear extension they use a “strategy”. Moreover they consider situations where both players use a different strategy. This turns out to be unnecessary as we will see in the next post.) Then the “minimizing player” MIN always selects a minimum move according to $\prec$ while the “maximizing player” MAX always selects a maximum move. Clearly different linear extensions lead to different gameplay for both players but nevertheless this idea will lead to a definition of gameplay that only depends on the underlying partial order and not on a particular choice of strategy. In what follows the central idea is therefore to single out properties that hold for all possible linear extensions and are “universal” in that sense.

Based on this idea we will construct minimizing and maximizing orders on finite antichains of $X$. Suppose that player MIN is in a situation where only two moves are possible. The first move will lead to a finite set $A \subseteq X$ of best moves for player MAX to choose from while the second move leads to a finite set $B$. Then player MAX will select the maximum $\curlyvee A$ when MIN plays the first move and $\curlyvee B$ when MIN plays the second. For player MIN this means that the first move is preferred over the second precisely when $\curlyvee A \preceq \curlyvee B$. If this relation holds for all possible linear extensions then the first move for MIN is universally better than the second. We can take this as the definition for a minimizing order $\leqslant$ on finite antichains in $X$ and express it as $A \leqslant B$. This definition is potentially rather awkward, quantifying over all linear extensions, but it can be drastically simplified as I will now show.

If $A \subseteq {\downarrow} B$ then for each $x \in A$ there is a $y \in B$ such that $x \leq y$. In particular, for any linear extension the inequality $x \preceq y$ also holds and therefore $\curlyvee A \preceq \curlyvee B$. So $A \leqslant B$ holds in this case. Now assume instead that there is an $x \in A$ such that $x \not \in {\downarrow} B$. Then according to our lemma there is a linear extension such that $y \prec x$ for all $y \in B$ and so $\curlyvee B \prec \curlyvee A$ for this particular linear extension. In this case the relation $A \leqslant B$ does not hold. This justifies the definition below. An alternative formulation that works for all finite subsets is included as a small bonus. See the intermezzo for some more background.

Minimizing order: For finite antichains $A, B \subseteq X$ define the relation $A \leqslant B$ as $A \subseteq {\downarrow} B$ or alternatively for all finite subsets as $(A \setminus B) \subseteq {\downarrow}(B \setminus A)$.

Similar considerations (with the roles of the MIN and MAX player reversed) lead to the following definition of a maximizing order, where ${\uparrow} A$ now indicates the smallest upper set containing $A$:

Maximizing order: For finite antichains $A, B \subseteq X$ define the relation $A \eqslantless B$ as $B \subseteq {\uparrow} A$ or alternatively for all finite subsets as $(B \setminus A) \subseteq {\uparrow}(A \setminus B)$.

The minimizing order expresses that the minimizing player MIN prefers the set of game positions $A$ over $B$ while the maximizing order expresses that player MAX prefers $B$ over $A$. It is not difficult to check that the maximizing and minimizing orders are in fact partial orders on finite antichains in $X$. Note that when $A, B \subseteq X$ are finite antichains and $A \subseteq B$ then $B \eqslantless A$ and $A \leqslant B$. In particular for any finite antichain $A \subseteq X$ the inequalities $\varnothing \leqslant A$ and $A \eqslantless \varnothing$ hold. This also shows that these are different partial orders in general.

Intermezzo. The alternative formulation of the maximizing and minimizing orders results in partial orders on all finite subsets. These partial orders are then almost but not quite identical to what Dasgupta et al call “procedure compare.” Where Dasgupta et al prefer a non-empty set over an empty one in that procedure, the maximizing and minimizing orders always prefer the empty set over a non-empty one as already stated before. The transitivity property requires some attention and can be proved by the nice picture below, where it is assumed that $A \eqslantless B$ and $B \eqslantless C$.

Descending chains in the transitivity proof

The nodes in the diagram correspond to subsets of $A \cup B \cup C$ where the letters in a node select all elements that are exactly in these subsets and in no others. So “A” means $A \setminus (B \cup C)$ and “A, B” means $(A \cap B) \setminus C$ and so on. The top two nodes together form the set $C \setminus A$ and the bottom two nodes $A \setminus C$. Directed edges leaving a node indicate that for each $a$ in the source set there exists $b$ in one of the target sets such that $b < a$. The edges together encode the information obtained from the relations $A \eqslantless B$ and $B \eqslantless C$. Since descending chains must be finite each chain starting in $C \setminus A$ (top row) must eventually reach $A \setminus C$ (bottom row) and transitivity follows. End of intermezzo.

Until now I only considered the conditions under which a single antichain $A$ is preferred over an antichain $B$ for player MIN, which is expressed by $A \leqslant B$. However, player MIN will often have to choose from more than two possible moves so we also have to consider conditions under which a set $\{A_1, \dotsc, A_n\}$ of more than one antichain is preferred over $B$. That would mean that for any linear extension $\prec$ there is some index $j$ such that $\curlyvee A_j \preceq \curlyvee B$ where the index might depend on the choice of linear extension. However, it turns out that in this case there must be an index $j$ such that already $A_j \leqslant B$. In other words, player MIN prefers a set of antichains over $B$ if and only if some member of that set is already preferred over $B$. This again follows from the lemma. Let $\prec$ be a linear extension such that $x \prec y$ for all $x \in B$ and $y \not \in {\downarrow} B$. Suppose that $A_j \not \leqslant B$ for all indices $j$. Then for each index $j$ there is an $x_j \in A_j$ such that $x_j \not \in {\downarrow} B$. Then $\curlyvee B \prec x_j \preceq \curlyvee A_j$ for all indices $j$ and therefore the set $\{A_1, \dotsc, A_n\}$ is not preferred over $B$.

Let $X^{\ast}$ denote the set of finite (possibly empty) antichains in the partially ordered set $X$.  Then both $(X^{\ast}, \eqslantless)$ and $(X^{\ast}, \leqslant)$ are partially ordered sets with the maximizing and minimizing order respectively. Later on the lattice structure of such partial orders will play an important role. A partially ordered set $(X, \leq)$ is a lattice when all non-empty finite subsets $S \subseteq X$ have a greatest lower bound $\wedge S$ and a least upper bound $\vee S$. In that case it is easy to see that $\wedge S$ and $\vee S$ are uniquely defined. The following theorems give sufficient conditions for $(X^{\ast}, \eqslantless)$ and $(X^{\ast}, \leqslant)$ to be lattices.

Maximizing lattice: In the partially ordered set $(X^{\ast}, \eqslantless)$ every non-empty finite subset $S \subseteq X^{\ast}$ has a greatest lower bound $\wedge S \in X^{\ast}$. If every non-empty finite subset $T \subseteq X$ has a least upper bound $\vee T \in X$ then the same holds for $X^{\ast}$ and so $(X^{\ast}, \eqslantless)$ is a lattice.

Minimizing lattice: In the partially ordered set $(X^{\ast}, \leqslant)$ every non-empty finite subset $S \subseteq X^{\ast}$ has a least upper bound $\vee S \in X^{\ast}$. If every non-empty finite subset $T \subseteq X$ has a greatest lower bound $\wedge T \in X$ then the same holds for $X^{\ast}$ and so $(X^{\ast}, \leqslant)$ is a lattice.

I will prove only the maximizing lattice property, so the partial order on $X^{\ast}$ is the maximizing order in what follows. Let $S_1, \dotsc, S_n \in X^{\ast}$ be finite antichains. The union $S = S_1 \cup \dotsc \cup S_n$ is a finite subset of $X$ but not necessarily an antichain.  Let $\min(S) \subseteq S$ be the set of minimal elements in $S$, that is, those elements which have no smaller elements in $S$. Then $\min(S)$ is an antichain and because $S$ is finite $\min(S) \eqslantless S_j$ for all indices $j$. So $\min(S)$ is a lower bound for $\{S_1, \dotsc, S_n\} \subseteq X^{\ast}$. If $S' \in X^{\ast}$ is any lower bound then for each $x \in S$ there is a $y \in S'$ such that $y \leq x$. This means that $S' \eqslantless \min(S)$ since $\min(S) \subseteq S$ and so $\min(S)$ is the greatest lower bound.

Now suppose every non-empty finite subset $T \subseteq X$ has a least upper bound $\vee T \in X$. If $U' \in X^{\ast}$ is an upper bound for $\{S_1, \dotsc, S_n\} \subseteq X^{\ast}$ then for all $y \in U'$ there exist $x_1 \in S_1, \dotsc, x_n \in S_n$ such that $x_j \leq y$ for all indices $j$ and so $\vee\{x_1, \dotsc, x_n\} \leq y$. Let $U \subseteq X$ be the set of minimal elements

$U = \min \left\{ \vee \{x_1, \dotsc, x_n\} \mid (x_1, \dotsc, x_n) \in S_1 \times \dotsc \times S_n \right\}$.

Then $U$ is an antichain and $U \eqslantless U'$ by the above. Now $U$ is itself also an upper bound for $\{S_1, \dotsc, S_n\}$ and so it is the least upper bound. This shows that under the given condition $(X^{\ast}, \eqslantless)$ is a lattice.

To sum up the results of this post: For any partially ordered set $(X, \leq)$ we constructed a minimizing and a maximizing partial order on the set $X^{\ast}$ of finite antichains in $X$. These indicate when a set of outcomes is universally preferred (under all linear extensions) over another set by a minimizing and maximizing player respectively. For these “star extended” partially ordered sets we defined the lattice operations of taking the greatest lower bound and least upper bound (when these exist). In the next post in this series we will start to explore actual game trees under a partial order, starting with the definition of evaluation functions.

Posted in Uncategorized | 1 Comment

## Monoids and data structures

Thinking about ways to formally specify behavior of software it occurred to me that there is a very natural way to express data structures and their operations in the language of monoids without referring to an actual implementation of that data structure. A monoid is a set with an associative binary operator (written as multiplication) and an identity element. Monoids in this post will be given by a generating set and a number of relations between the generators. Multiplication corresponds to adding an element to a data structure or merging two data structures. Functions on data structures are defined recursively using the fact that each member can be expressed as some product of generators.

As a first example I will define a stack datatype (last in first out) with the operation “push”, “pop” and “top”. A stack of type T is simply the free monoid generated by all elements of T without any additional relations:

$\text{stack}\langle T \rangle = \langle T \rangle$

The empty stack is represented by the unit $1 \in \text{stack} \langle T \rangle$. Pushing an element $x \in T$ onto a stack $s \in \text{stack} \langle T \rangle$ is defined by

$\text{push}{:{}} \text{stack} \langle T \rangle \times T \rightarrow \text{stack} \langle T \rangle\\ \text{push}(s, x) = s \cdot x$

Popping the top element from a non-empty stack is defined recursively as

$\text{pop}{:{}} \text{stack} \langle T \rangle \rightarrow \text{stack} \langle T \rangle\\ \text{pop}(s) = \begin{cases} 1 & \text{if } s \in T \\ u \cdot \text{pop}(v) & \text{if } s = u \cdot v \text{ for some }u, v \neq 1\end{cases}$.

The top element on a non-empty stack is given by the following recursive definition

$\text{top}{:{}} \text{stack} \langle T \rangle \rightarrow T\\ \text{top}(s) = \begin{cases} s & \text{if } s \in T \\ \text{top}(v) & \text{if }s = u \cdot v \text{ for some } u, v \neq 1\end{cases}$

A queue (first in first out) has a very similar definition. These first two examples are based on free monoids which makes it somewhat easier to define their structure and operations. As a slightly more interesting example I will now define a set (as an unordered sequence in which each element appears at most once). The monoid for a set of elements in $T$ is generated by $T$ and some additional relations:

$\text{set} \langle T \rangle = \langle T \mid x \cdot x = x \text{ for all }x \text{ and } x \cdot y = y \cdot x \text { for all } x \neq y \rangle$

These additional relations allow a “compact” representation of sets in the monoid (for example $x \cdot x \cdot x = x$) but more importantly they capture the requirements of a set. The relation $x \cdot x = x$ expresses that all elements in a set have multiplicity one (I will come back to this in a moment) and $x \cdot y = y \cdot x$ that a set is unordered. Here are the definitions of some sensible set operations:

Insert an element:

$\text{insert}{:{}} \text{set} \langle T \rangle \times T \rightarrow \text{set} \langle T \rangle\\ \text{insert}(s, x) = s \cdot x$

Set union:

$\text{union}{:{}} \text{set} \langle T \rangle \times \text{set} \langle T \rangle \rightarrow \text{set} \langle T \rangle\\ \text{union}(s, t) = s \cdot t$

Test if an element is present:

$\text{contains}{:{}} \text{set} \langle T \rangle \times T \rightarrow \{ \textbf{true}, \textbf{false} \}\\ \text{contains}(s, x) = \begin{cases} \textbf{false} & \text{if } s = 1\\ \textbf{true} & \text{if }s = x\\ \text{contains}(u, x) \vee \text{contains}(v, x) & \text{if } s = u \cdot v \text{ for some }u, v \neq 1\end{cases}$

Remove an element:

$\text{remove}{:{}} \text{set} \langle T \rangle \times T \rightarrow \text{set} \langle T \rangle\\ \text{remove}(s, x) = \begin{cases} 1 & \text{if } s = x\\ s & \text{if }s = 1 \text{ or }s \in T \text{ and } s \neq x\\ \text{remove}(u, x) \cdot \text{remove}(v, x) & \text{if } s = u \cdot v \text{ for some }u, v \neq 1\end{cases}$

All the operations are well defined given the relations on the monoid $\text{set} \langle T \rangle$. On the other hand, the following multiplicity function is not well defined even though it might look acceptable at first glance:

$\text{mult}{:{}} \text{set} \langle T \rangle \times T \rightarrow \mathbb{N}\\ \text{mult}(s, x) = \begin{cases} 1 & \text{if }s = x\\ 0 & \text{if }s = 1 \text{ or } s \in T \text{ and } s \neq x\\ \text{mult}(u) + \text{mult}(v) & \text{if }s = u \cdot v \text{ for some }u, v \neq 1\end{cases}$

since it would result in $1 = \text{mult}(x, x) = \text{mult}(x \cdot x, x) = 1 + 1$. The problem here is that $+$ is symmetric but not idempotent and does not respect the relations $x \cdot x = x$ that hold in $\text{set} \langle T \rangle$. This underlines that the relations encode the requirements for a data structure by restricting the possible operations that can be defined on them. Note that dropping the relation $x \cdot x = x$ would result in a multi set on which the operation $\text{mult}$ is well defined.

As a final example I will define an associative array, without spending too many words. Here are the definitions, where $\pi{:{}} S \times T \rightarrow S$ is projection on the first coordinate:

$\text{map}\langle S, T \rangle = \langle S \times T \mid x \cdot y = y \text { if } \pi(x) = \pi(y) \text{ and } x \cdot y = y \cdot x \text{ if } \pi(x) \neq \pi(y) \rangle$

Insert (or update) a pair $x \in S \times T$:

$\text{insert}{:{}} \text{map} \langle S, T \rangle \times (S \times T) \rightarrow \text{map} \langle S, T \rangle\\ \text{insert}(m, x) = m \cdot x$

Test if a key $x \in S$ is present:

$\text{has}{:{}} \text{map} \langle S, T \rangle \times S \rightarrow \{ \textbf{true}, \textbf{false} \}\\ \text{has}(m, x) = \begin{cases}\textbf{true} & \text {if }m \in S \times T \text{ and } \pi(m) = x\\ \textbf{false} & \text{if } m=1 \text{ or } m \in S \times T \text{ and } \pi(m) \neq x\\ \text{has}(u, x) \vee \text{has}(v, x) & \text{if }m = u \cdot v \text{ for some }u, v \neq 1\end{cases}$

Lookup the value that is associated with key $x \in S$ if it is present:

$\text{index}{:{}} \text{map} \langle S, T \rangle \times S \rightarrow T\\ \text{index}(m, x) = \begin{cases}y & \text{if }m = (x, y) \in S \times T\\ \text{index}(v, x) & \text{if }m = u \cdot v \text{ for some }u, v \neq 1 \text{ and } \text{has}(v, x) = \textbf{true}\\ \text{index}(u, x) & \text{if }m = u \cdot v \text{ for some }u, v \neq 1 \text{ and } \text{has}(v, x) = \textbf{false}\end{cases}$

Remove the key $x \in S$:

$\text{del}{:{}} \text{map} \langle S, T \rangle \times S \rightarrow \text{map} \langle S, T \rangle\\ \text{del}(m, x) = \begin{cases}1 & \text{if } m \in S \times T \text{ and } \pi(m) = x\\ m & \text{if }m = 1 \text{ or }m \in S \times T \text{ and } \pi(m) \neq x\\ \text{del}(u, x) \cdot \text{del}(v, x) & \text{if }m = u \cdot v \text{ for some }u, v \neq 1\end{cases}$

## Functions with positive real part

Let $f$ be a holomorphic function on the unit disc $\Delta$ such that $f(0)=1$ and $\Re f(z) > 0$ for all $z \in \Delta$. If the Taylor expansion of $f$ is

$f(z) = 1 + 2a_1z+2a_2z^2+2a_3z^3+\dotsc$

then a classical results states that $|a_n| \leq 1$ for all integer $n \geq 1$. Moreover, such functions are in one-to-one correspondence with probability measures on the unit circle $\Gamma$ as follows. Let $\mu$ be a probability measure on $\Gamma$ and let $\hat{\mu}_k \in \mathbb{C}$ for $k \in \mathbb{Z}$ be its Fourier coefficients

$\hat{\mu}_k = \displaystyle \int_{\Gamma} z^{-k} d\mu(z)$.

Then $\hat{\mu}_0 = 1$, $|\hat{\mu}_k| \leq 1$ and $\hat{\mu}_{-k}$ is the complex conjugate of $\hat{\mu}_k$. If $\mu$ has a density function $\delta$ with respect to the Lebesgue probability measure that is holomorphic on some neighborhood of $\Gamma$ then $\delta$ is real-valued on $\Gamma$ and has the Laurent expansion

$\delta(z) = \displaystyle \sum_{k \in \mathbb{Z}} \hat{\mu}_k z^k$.

Now define a function $f_{\mu}$ on a neighborhood of the closed disc $\overline{\Delta}$ as follows:

$f_{\mu}(z) = 1 + 2\displaystyle \sum_{k = 1}^{\infty} \hat{\mu}_k z^k$.

Then, using the fact that $\overline{z} = z^{-1}$ for all $z \in \Gamma$, for all $z \in \Gamma$ we have $\Re f(z) = \delta(z) \geq 0$ and therefore $\Re f(z) > 0$ for all $z \in \Delta$. It turns out that the correspondence $\mu \leftrightarrow f_{\mu}$ between probability measures and functions with positive real part also holds for $\mu$ that not have such a nice density function. (In general the density is a hyper function on $\Gamma$ related to $f_{\mu}$). The correspondence that I sketched here is usually presented as a convexity result in functional analysis, but I think that the direct link to density functions is quite instructive.

There is however also a different way to parameterize functions with a positive real part but it is much less known (I’d be interested in any reference in fact). Let $g$ be any holomorphic function on a neighborhood of the closed disc $\overline{\Delta}$ with Taylor expansion

$g(z) = \displaystyle \sum_{k=0}^{\infty} a_kz^k$.

Let $\overline{g}$ be the holomorphic function such that $\overline{g}( \overline{z} ) = \overline{g(z)}$, so the coefficients of $\overline{g}$ are the complex conjugates of those of $g$. Then for all $z \in \Gamma$

$g(z) \overline{g}(z^{-1}) = g(z) \overline{g(z)} = |g(z)|^2 \geq 0$.

Now the function $g(z) \overline{g}(z^{-1})$ has the Laurent expansion

$(\displaystyle \sum_{k=0}^{\infty} a_kz^k)(\sum_{k=0}^{\infty} \overline{a}_kz^{-k}) = \sum_{k \in \mathbb{Z}} \sum_{m = 0}^{\infty} a_{m+k} \overline{a}_m z^k$

where I used the convention that $a_k=0$ for all $k < 0$. Now the sequence $a$ is square summable and so it is an element of the Hilbert space $\ell^2$. Let $T: \ell^2 \rightarrow \ell^2$ be the left shift operator, so $(T^ka)_m = a_{m+k}$ for all $k \geq 0$ and indices $m$. Then the Laurent series above can be rewritten as

$\displaystyle (\sum_{k=1}^{\infty} \langle a, T^ka \rangle z^{-k}) + ||a||^2 + (\sum_{k=1}^{\infty} \langle T^ka, a \rangle z^k)$

where $\langle \cdot, \cdot \rangle$ denotes the sesquilinear form on $\ell^2$. Note that by the Cauchy-Schwarz inequality $|\langle T^ka, a \rangle| \leq ||T^ka|| \cdot ||a|| \leq ||a||^2$. Now define a holomorphic function $f_g$ on a neighborhood of the closed disc $\overline{\Delta}$ by

$f_g(z) = ||a||^2 + 2 \displaystyle \sum_{k=1}^{\infty} \langle T^ka, a \rangle z^k$.

Then for all $z \in \Gamma$ we have $\Re f_g(z) = |g(z)|^2 \geq 0$ and therefore $\Re f_g(z) > 0$ on the disc $\Delta$. The association $g \mapsto f_g$ is not injective however, since $g_1$ and $g_2$ map to the same function if $|g_1(z)|^2 = |g_2(z)|^2$ on the unit circle $\Gamma$. In this case, both $g_1$ and $g_2$ are equal to some function $g$ without zeroes on the disc $\Delta$ multiplied by appropriate finite Blaschke products. This function $g$ is then unique up to multiplication with a constant on the unit circle.

## Polynomial residues

If p is a prime then $m^p \equiv m$ (mod p) for all integers m. So the polynomial $f = X^p-X \in \mathbb{Z}[X]$ has the property that $f(m) \equiv 0$ (mod p) for all integers m. Since $\mathbb{Z}_p$ is an integral domain, any polynomial in $\mathbb{Z}[X]$ of degree d that is not divisible by p has at most d roots in $\mathbb{Z}_p$. The degree of a polynomial that vanishes identically on $\mathbb{Z}_p$ must therefore be at least p, so $X^p-X$ is also minimal in that sense. What can we say if we replace the prime p by just any positive integer n? In other words, what can we say about $f \in \mathbb{Z}[X]$ if $f(m) \equiv 0$ (mod n) for all integers m? This question was studied by Aubrey Kempner in the 1920′s. Here are some of his results:

“Suppose $f \in \mathbb{Z}[X]$ has degree d and vanishes identically on $\mathbb{Z}_n$. Let $\dfrac{n}{d!} = \dfrac{m}{q}$ for coprime integers m and q. Then f is divisible by m.”

In particular if the coefficients of f have no common factor then n must divide d!, the factorial of its degree. For prime n this implies that f is divisible by n unless d is at least n and this is in accordance with the first paragraph above. To see that this bound is sharp for any n define the binomial polynomials $\binom{x}{d}$ by

$\displaystyle \binom{x}{d} = \dfrac{x (x-1) \dotsc (x-d+1)}{d!}$.

The binomial polynomials take integer values at integer arguments (the number of ways to choose d elements from a set of x). Therefore, if n divides d! then $d! \binom{x}{d}$ is a monic polynomial of degree d in $\mathbb{Z}[X]$ that vanishes identically on $\mathbb{Z}_n$. The other way around, suppose $f \in \mathbb{Z}[X]$ has degree d and $f(m) \equiv 0$ (mod n) for all integers m. Let $\dfrac{m}{q} = \dfrac{n}{d!}$ as before. Then there exist coefficients $c_0, \dotsc, c_d \in \mathbb{Z}$ such that

$f(x) = n \displaystyle \sum_{k=0}^d c_k \binom{x}{k} = \frac{m}{q} \sum_{k=0}^d c_k d! \binom{x}{k}$.

For each k the polynomial $c_k d! \binom{x}{k}$ has integer coefficients and since m and q are coprime the equation above implies that m divides f.

## A short alternative to Gordan

The nicely named “theorem of the alternative” of Paul Gordan states the following

If $V \subset \mathbb{R}^n$ is a linear subspace then either $V$ contains a vector with positive coordinates or the orthogonal complement $V^{\perp}$ contains a non-zero vector with non-negative coordinates.”

These options are mutually exclusive, hence the name of the theorem. Gordan gave a proof of this in 1873 which involved a very clever inductive argument. I just thought of a self contained very short proof of this theorem. In fact I will prove an equivalent statement

If $v_1, \dotsc, v_n \in \mathbb{R}^k$ then either there is a vector $v \in \mathbb{R}^k$ such that the inner product $\langle v, v_j \rangle$ is positive for all $j \in \{1, \dotsc, n\}$ or the convex hull $C$ of $v_1, \dotsc, v_n$ contains the origin.”

The equivalence can be seen as follows. Let $M$ be the $n \times k$ matrix with $v_j$ as its rows. Then the columns of $M$ span a subspace $V \subset \mathbb{R}^n$ and this transforms the second statement into the first and vice versa.

Now let $v \in C$ have minimal norm. If $|v| = 0$ then $C$ contains the origin. If $|v| > 0$ and $w$ is any point in the convex hull then the line segment from $v$ to $w$ lies in $C$ and contains no point with smaller norm than $v$. So for all $t \in (0,1]$ we have

$\dfrac{|(1-t) v + t w|^2 - |v|^2}{2t} = \langle v, w \rangle - |v|^2+ \dfrac{t}{2}|w-v|^2 \geq 0$.

This implies that $\langle v, w \rangle \geq |v|^2 > 0$ and in particular this holds for all $w \in \{v_1, \dotsc, v_n \}$. Done!