What would you consider the "keywords" for lisp; the special forms? Seems somewhat arbitrary; for example, common lisp specifies that IF is a special form and COND is a macro, but could easily have made the opposite choice, and only a small number of programs that do analysis of other programs would need to be any different. Andy On Mon, Jun 16, 2014 at 10:33 PM, Warren D Smith <warren.wds@gmail.com> wrote:
I got some large programs in a few languages and counted how often various words appeared. Sometimes I cheated a bit.
QUESTION: I would appreciate it is anybody could supply corresponding data for some large programs in some other languages such as LISP, ML, or whatever. Here are my results:
PASCAL chess0.5 by P.Frey+L.Atkin 1978: ":="=666, END=316, BEGIN=287, IF=205, THEN=205, DO=140, PROCEDURE=115, OF=107, VAR=80, ARRAY=74, ELSE=67, TO=63, FOR=66, FALSE=55, TRUE=48, WHILE=43, GOTO=42, WRITE=39, WRITELN=37, AND=32, NOT=32, WITH=31, CASE=27, FUNCTION=21, ABS=14, LABEL=11, DIV=9, DOWNTO=7. PACKED=6, RECORD=6, UNTIL=5, TYPE=2, NAME=2, ONLY=1, CONST=1.
C++ Gull-II chess program by Vadim Demichev: " = "=2189, IF=1409, "{"=1006, INT=498, DEFINE=362, ELSE=327, RETURN=282, FOR=260, UINT=205, DO=168, LSB=203, VOID=81, CONST=80, BOOL=78, TEMPLATE=72, GOTO=69, FPRINTF=65, ENDIF=63, STDOUT=59, MIN=58, CONTINUE=57, IFDEF=56, SIZEOF=51, DO=49, CHAR=41, BREAK=32, UNDEF=30, POPCOUNT=28, VOLATILE=26, INLINE=22, WHILE=20, MEMSET=19, STRCMP=18, FILE=18, DOUBLE=15, TYPEDEF=15, STRUCT=14, CASE=14, UNSIGNED=13, RAND=12, DEFAULT=8, MAX=7, ENUM=4.
C++ Senpai1.0 chess program by Fabian Letouzey: " = "=1299, INT=1293, "{"=1188, IF=441, RETURN=426, CONST=308, ASSERT=286, FOR=206, VOID=179, STD=159, BOOL=138, FILE=131, FALSE=128, ELSE=119, TRUE=93, POPCOUNT=59, UINT=45, LSB=43, STATIC=27, CASE=26, MAX=26, BREAK=22, MIN=21, RAND=21, CHAR=15, DOUBLE=19, WHILE=18, ENUM=14, STRUCT=13, TYPEDEF=12, VOLATILE=11, SIGNED=11, CONTINUE=8, DEFAULT=7, SIZEOF=6, DEFINE=6.
C Bzip2 de/compress program by Julian Seward: " = "=1144, IF=701, "{"=573, INT=451, RETURN=276, DEFINE=218, FOR=208, UCHAR=151, VOID=134, UINT=134, BREAK=110, FILE=98, ELSE=96, WHILE=94, STDERR=89, FPRINTF=86, STATIC=83, TRUE=82, CHAR=80, CASE=79, GOTO=46, BOOL=45, MAX=41, CONTINUE=40, UNDEF=35, ENDIF=35, UNSIGNED=32, RAND=32, SIZEOF=27, INCLUDE=26, QSORT=24, ASSERT=24, STDOUT=21, DO=21, FCLOSE=19, EOF=19, FOPEN=15, TYPEDEF=15, STDIN=15, CONST=13, STRLEN=10.
C: Gcc C-compiler (now dead version circa 2004): " = "=83693, "{"=59732, IF=54646, RETURN=31658, RAND=29586, CASE=28527, FOR=26467, STR=22012 (count includes many functions), ELSE=21005, CONST=16284, GOTO=15923, INT=13078, VOID=12765, STATIC=10312, DEFINE=10004, BREAK=8035, CHAR=7578, ENDIF=6755, UNSIGNED=6512, MIN=4374, ENUM=4181, TRUE=3328, DO=3271, PRINTF=2884, DEFAULT=2519, FALSE=2366, BOOL=2335, MAX=2278, SWITCH=2215, WHILE=2177, SIZEOF=1924, INLINE=1552, FILE=1302, CONTINUE=1301, UNDEF=1159, FLOAT=986, INCLUDE=845, CLEAR=687, UNION=669, SIGNED=656, DOUBLE=621, TYPEDEF=513, VOLATILE=430, STDERR=400, SWAP=346, MEMSET=335, UCHAR=278, SORT=275, UINT=191, ASSERT=129, POPCOUNT=46, STDOUT=21.
Fortran: 140 programs collected by Don Knuth 1971, I just repeat his counts from 1st column of table 1 in http://www.cs.tufts.edu/~nr/cs257/archive/don-knuth/empirical-fortran.pdf : (Assignment)=41%, IF=14.5%, GOTO=13, CALL=8, CONTINUE=5, WRITE=4, FORMAT=4, DO=4, DATA=2, RETURN=2, DIMENSION=2, COMMON=1.5, END=1, BUFFER=1, SUBROUTINE=1, REWIND=1.
Others: Christian S. Collberg, Ginger Myles, Michael Stepp: An empirical study of Java bytecode programs, Software Practice and Experience 37,6 (2007) 581-641. http://goto.ucsd.edu/~mstepp/publications/empirical.pdf They got 1132 java programs in bytecode, not source, form, and counted the bytecode frequencies plus much more. One interesting finding is, if you look at floating point constants in programs, 44% of them are in {0, 1, 0.5, 2.0} and you can boost 44% up to 56% if you also put in {255.0 100.0 -1.0 4.0 5.0 -inf 10.0 0.9 0.75 1000.0 64.0 3.0 pi NaN 20.0 4.0 90.0 0.25 8.0 180.0 2*pi 0.1 6.0 HUGE 360.0 1.0e-4 -2.0 pi/2 sqrt(0.5) +inf } thus proving the real numbers have a lot less entropy than you thought :)
Michael D. Ernst, Greg J. Badros, and David Notkin: An empirical analysis of C preprocessor use, IEEE Transactions on Software Engineering 28,12 (2002) 1146-1170. http://homes.cs.washington.edu/~mernst/pubs/c-preprocessor-tse2002.pdf
Robert P. Cook & Insup Lee: A contextual analysis of Pascal programs, Software: Practice and Experience 12,2 (February 1982) 195-203 [can anybody supply PDF?] examined 264 pascal programs, partial info here: http://warriors.eecs.umich.edu/old_schedules/Readings/2001-03-28-Pascal.pdf
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
-- Andy.Latto@pobox.com