[math-fun] keyword frequencies in programming languages
I got some large programs in a few languages and counted how often various words appeared. Sometimes I cheated a bit. QUESTION: I would appreciate it is anybody could supply corresponding data for some large programs in some other languages such as LISP, ML, or whatever. Here are my results: PASCAL chess0.5 by P.Frey+L.Atkin 1978: ":="=666, END=316, BEGIN=287, IF=205, THEN=205, DO=140, PROCEDURE=115, OF=107, VAR=80, ARRAY=74, ELSE=67, TO=63, FOR=66, FALSE=55, TRUE=48, WHILE=43, GOTO=42, WRITE=39, WRITELN=37, AND=32, NOT=32, WITH=31, CASE=27, FUNCTION=21, ABS=14, LABEL=11, DIV=9, DOWNTO=7. PACKED=6, RECORD=6, UNTIL=5, TYPE=2, NAME=2, ONLY=1, CONST=1. C++ Gull-II chess program by Vadim Demichev: " = "=2189, IF=1409, "{"=1006, INT=498, DEFINE=362, ELSE=327, RETURN=282, FOR=260, UINT=205, DO=168, LSB=203, VOID=81, CONST=80, BOOL=78, TEMPLATE=72, GOTO=69, FPRINTF=65, ENDIF=63, STDOUT=59, MIN=58, CONTINUE=57, IFDEF=56, SIZEOF=51, DO=49, CHAR=41, BREAK=32, UNDEF=30, POPCOUNT=28, VOLATILE=26, INLINE=22, WHILE=20, MEMSET=19, STRCMP=18, FILE=18, DOUBLE=15, TYPEDEF=15, STRUCT=14, CASE=14, UNSIGNED=13, RAND=12, DEFAULT=8, MAX=7, ENUM=4. C++ Senpai1.0 chess program by Fabian Letouzey: " = "=1299, INT=1293, "{"=1188, IF=441, RETURN=426, CONST=308, ASSERT=286, FOR=206, VOID=179, STD=159, BOOL=138, FILE=131, FALSE=128, ELSE=119, TRUE=93, POPCOUNT=59, UINT=45, LSB=43, STATIC=27, CASE=26, MAX=26, BREAK=22, MIN=21, RAND=21, CHAR=15, DOUBLE=19, WHILE=18, ENUM=14, STRUCT=13, TYPEDEF=12, VOLATILE=11, SIGNED=11, CONTINUE=8, DEFAULT=7, SIZEOF=6, DEFINE=6. C Bzip2 de/compress program by Julian Seward: " = "=1144, IF=701, "{"=573, INT=451, RETURN=276, DEFINE=218, FOR=208, UCHAR=151, VOID=134, UINT=134, BREAK=110, FILE=98, ELSE=96, WHILE=94, STDERR=89, FPRINTF=86, STATIC=83, TRUE=82, CHAR=80, CASE=79, GOTO=46, BOOL=45, MAX=41, CONTINUE=40, UNDEF=35, ENDIF=35, UNSIGNED=32, RAND=32, SIZEOF=27, INCLUDE=26, QSORT=24, ASSERT=24, STDOUT=21, DO=21, FCLOSE=19, EOF=19, FOPEN=15, TYPEDEF=15, STDIN=15, CONST=13, STRLEN=10. C: Gcc C-compiler (now dead version circa 2004): " = "=83693, "{"=59732, IF=54646, RETURN=31658, RAND=29586, CASE=28527, FOR=26467, STR=22012 (count includes many functions), ELSE=21005, CONST=16284, GOTO=15923, INT=13078, VOID=12765, STATIC=10312, DEFINE=10004, BREAK=8035, CHAR=7578, ENDIF=6755, UNSIGNED=6512, MIN=4374, ENUM=4181, TRUE=3328, DO=3271, PRINTF=2884, DEFAULT=2519, FALSE=2366, BOOL=2335, MAX=2278, SWITCH=2215, WHILE=2177, SIZEOF=1924, INLINE=1552, FILE=1302, CONTINUE=1301, UNDEF=1159, FLOAT=986, INCLUDE=845, CLEAR=687, UNION=669, SIGNED=656, DOUBLE=621, TYPEDEF=513, VOLATILE=430, STDERR=400, SWAP=346, MEMSET=335, UCHAR=278, SORT=275, UINT=191, ASSERT=129, POPCOUNT=46, STDOUT=21. Fortran: 140 programs collected by Don Knuth 1971, I just repeat his counts from 1st column of table 1 in http://www.cs.tufts.edu/~nr/cs257/archive/don-knuth/empirical-fortran.pdf : (Assignment)=41%, IF=14.5%, GOTO=13, CALL=8, CONTINUE=5, WRITE=4, FORMAT=4, DO=4, DATA=2, RETURN=2, DIMENSION=2, COMMON=1.5, END=1, BUFFER=1, SUBROUTINE=1, REWIND=1. Others: Christian S. Collberg, Ginger Myles, Michael Stepp: An empirical study of Java bytecode programs, Software Practice and Experience 37,6 (2007) 581-641. http://goto.ucsd.edu/~mstepp/publications/empirical.pdf They got 1132 java programs in bytecode, not source, form, and counted the bytecode frequencies plus much more. One interesting finding is, if you look at floating point constants in programs, 44% of them are in {0, 1, 0.5, 2.0} and you can boost 44% up to 56% if you also put in {255.0 100.0 -1.0 4.0 5.0 -inf 10.0 0.9 0.75 1000.0 64.0 3.0 pi NaN 20.0 4.0 90.0 0.25 8.0 180.0 2*pi 0.1 6.0 HUGE 360.0 1.0e-4 -2.0 pi/2 sqrt(0.5) +inf } thus proving the real numbers have a lot less entropy than you thought :) Michael D. Ernst, Greg J. Badros, and David Notkin: An empirical analysis of C preprocessor use, IEEE Transactions on Software Engineering 28,12 (2002) 1146-1170. http://homes.cs.washington.edu/~mernst/pubs/c-preprocessor-tse2002.pdf Robert P. Cook & Insup Lee: A contextual analysis of Pascal programs, Software: Practice and Experience 12,2 (February 1982) 195-203 [can anybody supply PDF?] examined 264 pascal programs, partial info here: http://warriors.eecs.umich.edu/old_schedules/Readings/2001-03-28-Pascal.pdf
What would you consider the "keywords" for lisp; the special forms? Seems somewhat arbitrary; for example, common lisp specifies that IF is a special form and COND is a macro, but could easily have made the opposite choice, and only a small number of programs that do analysis of other programs would need to be any different. Andy On Mon, Jun 16, 2014 at 10:33 PM, Warren D Smith <warren.wds@gmail.com> wrote:
I got some large programs in a few languages and counted how often various words appeared. Sometimes I cheated a bit.
QUESTION: I would appreciate it is anybody could supply corresponding data for some large programs in some other languages such as LISP, ML, or whatever. Here are my results:
PASCAL chess0.5 by P.Frey+L.Atkin 1978: ":="=666, END=316, BEGIN=287, IF=205, THEN=205, DO=140, PROCEDURE=115, OF=107, VAR=80, ARRAY=74, ELSE=67, TO=63, FOR=66, FALSE=55, TRUE=48, WHILE=43, GOTO=42, WRITE=39, WRITELN=37, AND=32, NOT=32, WITH=31, CASE=27, FUNCTION=21, ABS=14, LABEL=11, DIV=9, DOWNTO=7. PACKED=6, RECORD=6, UNTIL=5, TYPE=2, NAME=2, ONLY=1, CONST=1.
C++ Gull-II chess program by Vadim Demichev: " = "=2189, IF=1409, "{"=1006, INT=498, DEFINE=362, ELSE=327, RETURN=282, FOR=260, UINT=205, DO=168, LSB=203, VOID=81, CONST=80, BOOL=78, TEMPLATE=72, GOTO=69, FPRINTF=65, ENDIF=63, STDOUT=59, MIN=58, CONTINUE=57, IFDEF=56, SIZEOF=51, DO=49, CHAR=41, BREAK=32, UNDEF=30, POPCOUNT=28, VOLATILE=26, INLINE=22, WHILE=20, MEMSET=19, STRCMP=18, FILE=18, DOUBLE=15, TYPEDEF=15, STRUCT=14, CASE=14, UNSIGNED=13, RAND=12, DEFAULT=8, MAX=7, ENUM=4.
C++ Senpai1.0 chess program by Fabian Letouzey: " = "=1299, INT=1293, "{"=1188, IF=441, RETURN=426, CONST=308, ASSERT=286, FOR=206, VOID=179, STD=159, BOOL=138, FILE=131, FALSE=128, ELSE=119, TRUE=93, POPCOUNT=59, UINT=45, LSB=43, STATIC=27, CASE=26, MAX=26, BREAK=22, MIN=21, RAND=21, CHAR=15, DOUBLE=19, WHILE=18, ENUM=14, STRUCT=13, TYPEDEF=12, VOLATILE=11, SIGNED=11, CONTINUE=8, DEFAULT=7, SIZEOF=6, DEFINE=6.
C Bzip2 de/compress program by Julian Seward: " = "=1144, IF=701, "{"=573, INT=451, RETURN=276, DEFINE=218, FOR=208, UCHAR=151, VOID=134, UINT=134, BREAK=110, FILE=98, ELSE=96, WHILE=94, STDERR=89, FPRINTF=86, STATIC=83, TRUE=82, CHAR=80, CASE=79, GOTO=46, BOOL=45, MAX=41, CONTINUE=40, UNDEF=35, ENDIF=35, UNSIGNED=32, RAND=32, SIZEOF=27, INCLUDE=26, QSORT=24, ASSERT=24, STDOUT=21, DO=21, FCLOSE=19, EOF=19, FOPEN=15, TYPEDEF=15, STDIN=15, CONST=13, STRLEN=10.
C: Gcc C-compiler (now dead version circa 2004): " = "=83693, "{"=59732, IF=54646, RETURN=31658, RAND=29586, CASE=28527, FOR=26467, STR=22012 (count includes many functions), ELSE=21005, CONST=16284, GOTO=15923, INT=13078, VOID=12765, STATIC=10312, DEFINE=10004, BREAK=8035, CHAR=7578, ENDIF=6755, UNSIGNED=6512, MIN=4374, ENUM=4181, TRUE=3328, DO=3271, PRINTF=2884, DEFAULT=2519, FALSE=2366, BOOL=2335, MAX=2278, SWITCH=2215, WHILE=2177, SIZEOF=1924, INLINE=1552, FILE=1302, CONTINUE=1301, UNDEF=1159, FLOAT=986, INCLUDE=845, CLEAR=687, UNION=669, SIGNED=656, DOUBLE=621, TYPEDEF=513, VOLATILE=430, STDERR=400, SWAP=346, MEMSET=335, UCHAR=278, SORT=275, UINT=191, ASSERT=129, POPCOUNT=46, STDOUT=21.
Fortran: 140 programs collected by Don Knuth 1971, I just repeat his counts from 1st column of table 1 in http://www.cs.tufts.edu/~nr/cs257/archive/don-knuth/empirical-fortran.pdf : (Assignment)=41%, IF=14.5%, GOTO=13, CALL=8, CONTINUE=5, WRITE=4, FORMAT=4, DO=4, DATA=2, RETURN=2, DIMENSION=2, COMMON=1.5, END=1, BUFFER=1, SUBROUTINE=1, REWIND=1.
Others: Christian S. Collberg, Ginger Myles, Michael Stepp: An empirical study of Java bytecode programs, Software Practice and Experience 37,6 (2007) 581-641. http://goto.ucsd.edu/~mstepp/publications/empirical.pdf They got 1132 java programs in bytecode, not source, form, and counted the bytecode frequencies plus much more. One interesting finding is, if you look at floating point constants in programs, 44% of them are in {0, 1, 0.5, 2.0} and you can boost 44% up to 56% if you also put in {255.0 100.0 -1.0 4.0 5.0 -inf 10.0 0.9 0.75 1000.0 64.0 3.0 pi NaN 20.0 4.0 90.0 0.25 8.0 180.0 2*pi 0.1 6.0 HUGE 360.0 1.0e-4 -2.0 pi/2 sqrt(0.5) +inf } thus proving the real numbers have a lot less entropy than you thought :)
Michael D. Ernst, Greg J. Badros, and David Notkin: An empirical analysis of C preprocessor use, IEEE Transactions on Software Engineering 28,12 (2002) 1146-1170. http://homes.cs.washington.edu/~mernst/pubs/c-preprocessor-tse2002.pdf
Robert P. Cook & Insup Lee: A contextual analysis of Pascal programs, Software: Practice and Experience 12,2 (February 1982) 195-203 [can anybody supply PDF?] examined 264 pascal programs, partial info here: http://warriors.eecs.umich.edu/old_schedules/Readings/2001-03-28-Pascal.pdf
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
-- Andy.Latto@pobox.com
Whitfield Diffie (private communication) concatenated all the files in the LISP directory of Gnu Emacs 21.1.1, removed comments (mostly), replaced everything except letters (a-z, A-Z), numerals (0-9), and hyphens, underscore, and colon, by carriage returns; finally ran sort | uniq -c | sort -nr. Here are the words with counts>5000: 48739 nil 45024 setq 43767 if 36920 t 29249 1 28873 \ 28675 defun 26355 \\ 25941 and 25566 let 23723 the 20944 0 18370 or 17903 a 16801 B 16742 car 16071 not 14338 to 14150 list 13659 point 13405 eq 12234 cdr 11970 s 11274 is 11260 nth 10981 2 10976 when 10630 defvar 10593 of 9652 define-key 9646 file 9571 interactive 9421 while 9026 - 9009 goto-char 8985 8711 in 8646 optional 8121 for 7848 string 7788 :group 7458 x 7388 arg 7310 concat 7205 cons 6882 quote 6855 progn 6738 buffer 6615 insert 6523 save-excursion 6234 end 6164 :type 6035 error 5935 cond 5919 defcustom 5831 name 5793 message 5378 3 5345 autoload 5261 n 5248 \n 5063 format
Such a table fails to distinguish between programmer-defined items, and ones that are really part of the language core. I get the impression you are interested in the latter, although I confess that I'm not sure what the agenda is. Are we supposed to learn something about the languages in question? What? All Lisp dialects have an intentionally vague boundary between terms that are defined by the language, and those defined by the programmer. Another way to put this is that all Lisp programs are Lisp language extensions. I would argue, for example, that "point", which features prominently in the list above, reveals something about the application (Emacs) rather than the language (Emacs Lisp). "Save-excursion" -- even more so. In Lisp, there is no syntactic clue that infallibly singles out language primitives. What are we trying to find out? Perhaps there is a better way than just counting lemmas; or perhaps, for Lisp at least, we are asking the wrong question. On Tue, Jun 17, 2014 at 7:36 PM, Warren D Smith <warren.wds@gmail.com> wrote:
Whitfield Diffie (private communication) concatenated all the files in the LISP directory of Gnu Emacs 21.1.1, removed comments (mostly), replaced everything except letters (a-z, A-Z), numerals (0-9), and hyphens, underscore, and colon, by carriage returns; finally ran sort | uniq -c | sort -nr. Here are the words with counts>5000:
48739 nil 45024 setq 43767 if 36920 t 29249 1 28873 \ 28675 defun 26355 \\ 25941 and 25566 let 23723 the 20944 0 18370 or 17903 a 16801 B 16742 car 16071 not 14338 to 14150 list 13659 point 13405 eq 12234 cdr 11970 s 11274 is 11260 nth 10981 2 10976 when 10630 defvar 10593 of 9652 define-key 9646 file 9571 interactive 9421 while 9026 - 9009 goto-char 8985 8711 in 8646 optional 8121 for 7848 string 7788 :group 7458 x 7388 arg 7310 concat 7205 cons 6882 quote 6855 progn 6738 buffer 6615 insert 6523 save-excursion 6234 end 6164 :type 6035 error 5935 cond 5919 defcustom 5831 name 5793 message 5378 3 5345 autoload 5261 n 5248 \n 5063 format
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
participants (3)
-
Allan Wechsler -
Andy Latto -
Warren D Smith