highest is a software that finds the highest numbers in a sequence of lines containing numbers.
Motivation:
* When a filesystem fills up, a System Administrator is likely to want to get a report listing the 100 (or so) largest files in the filesystem. This won't always pinpoint the source of the problem, but commonly will. This is often done using a command like:
o find /mntpt -type f -print0 | xargs -0 du -k | sort -nr | head -100
* highest is a faster way of doing that same thing. EG:
o find /mntpt -type f -print0 | xargs -0 du -k | highest -n 100 --use-heap
* Generally speaking, when a filesystem fills up, programs start messing up some files by writing and not checking error returns, or by not using temporary files prior to destroying the old copy of a file. So when a filesystem fills up, it's good to fix it ASAP.
* Also, on truly large filesystems like GFS or Lustre, you may have so many files that GNU sort starts using temporary files to do the sorting, slowing things down and possibly even compounding the problem you're trying to fix; highest has no such issues.
* Also, highest just seemed like a natural place to put my treap module for testing the treap implementation I did in Python/Cython. Although a heap seems like a more natural fit (a heap should use a bit less memory), the treap actually seems to outperform the heap. If you have no other use for treaps, using a treap with highest may not give enough of a performance boost to justify the time to install the module. EG:
o find /mntpt -type f -print0 | xargs -0 du -k | highest -n 100 --use-treap
Usage:
dstromberg@benchbox:~/src/svn/svn/highest/trunk$ ./highest --help
Usage: ./highest [-n numtokeep] [-e every] [-f fieldno] [-a] [-r] [-h] [--filename fn]
-n n specifies the number of numbers to keep
-e n says to only look at 1 in every n values
-f n says to compare on the nth field
-a says to sort in ASCII order, rather than numerically
-r says to sort in reverse
-s sent says if a field cannot be converted to a float,
to use sentinel value "sent" instead when sorting
-h says to give usage information (help)
--disallow-duplicates (or just -d) says to only keep one value of a given size
--filename fn says to read from file fn
--bufsize n says to read using a buffer size of n; only meaningful with --filename
--use-bisect
--use-treap
--use-heap
Reads from stdin by default, but that's considerably slower than --filename /dev/stdin
Product's homepage