=Andy Latto You get the quantity -ln(p) of information from a message that occurs with probability p, but you only get that information with probability p, so the expected amount of information received is the sum of -p(ln(p)) over all possible messages you might receive.
Andy, thanks for the clarification of the provenance of this formula. So this sum measures the average information delivered by the channel, versus that of a single message. But the maximum of such a sum isn't necessarily the maximum of the summand. Applying the above by viewing a number as a stream of digits: for a given base b we sum over the b possible "messages", each of which (presumably) occurs with equal probability p=1/b, giving b (-1/b ln(1/b)) = -ln(1/b) = ln(b) which says the larger the base the higher the information, not that the maximum occurs when b=e. What's wrong with this picture?