| |
|
|
(September 2006)
My daily routine - as a developer writing code for a living -
involves visits to sites like Slashdot,
Reddit and
Hacker News.
These visits usually end up with me downloading "stuff"
that I can take back home for... research. I have
a directory set-up in my PC (at work) for these "R&D"
materials, and periodically empty it to a rewritable DVD I carry
back and forth between work and home (Edit, April 2011:
DVDs are now all but obsolete; USB sticks are used in
their place, but the article is still valid - keep reading).
After a busy month of hacking, I realized that a lot of "R&D"
material was there; and decided to burn a DVD and take
it home. Only to find that more than 6GB were patiently waiting
in the research folder...
"Oh well, I'll just take the files in two passes - one today
and one tomorrow."
And at that point, the question formed in my mind:
What would be the optimal way of doing that,
filling up a DVD with a myriad files, making sure that it gets
filled up as much as possible? With no free space left unused?
Think about it. When you have more than a hundred files, there
are literally millions of combinations you can try. Which one is
the best?
Is it really millions? Well, here's a quick "brute-force" attempt:
import sys
g_maxSize=4480
def main():
collectionsToTest = []
for newNo in sys.stdin.readlines():
try:
newNo = int(newNo)
except:
print "%s is not numeric" % newNo
continue
newWing = [x+[newNo] for x in collectionsToTest]
collectionsToTest.append([newNo])
collectionsToTest.extend(newWing)
maximumLessThanMaxSize = 0
for set in collectionsToTest:
total = 0
for elem in set:
total += elem
if total > g_maxSize:
continue
if total > maximumLessThanMaxSize:
maximumLessThanMaxSize = total
print "Set:", set, "maximum:", \
maximumLessThanMaxSize
if __name__ == "__main__":
main()
|
main is split in two blocks. The first one
creates all possible sets you can form out of a series of
numbers, that are read from standard input. The second one checks each of
these sets to find the best set (the one that gets as close as
possible to 4480MB).
Each new number that is input increases the sets to test: the
new collectionsToTest is made of:
- the previous collection of sets
- the new element, in a single number set
- each one of the sets of the previous collection augmented
to include the new number
So, if you input numbers 4400,10,50,70, you get the
following collectionsToTest:
[4400] (4400 is input)
[10] (10 is input)
[4400, 10]
[50] (50 is input)
[4400, 50]
[10, 50]
[4400, 10, 50]
[70] (70 is input)
[4400, 70]
[10, 70]
[4400, 10, 70]
[50, 70]
[4400, 50, 70]
[10, 50, 70]
[4400, 10, 50, 70]
|
..and the following result from the prototype:
bash$ ./brutus.py
4400
10
50
70
(Pressing Ctrl-D)
Set: [4400] maximum: 4400
Set: [4400, 10] maximum: 4410
Set: [4400, 50] maximum: 4450
Set: [4400, 10, 50] maximum: 4460
Set: [4400, 70] maximum: 4470
Set: [4400, 10, 70] maximum: 4480
|
Now let's revisit the set creation: each new number that we input to a
previous collection of N sets, is adding a set of one element
(with the new member) and N new sets (the previous sets,
augmented to include the new number). This means that we go from
a collection of N sets to one with N+1+N sets. The work to perform therefore,
as we add elements, grows at a rate faster than 2^N; REALLY fast...
After just 20 numbers,
we have millions of combinations to test:
bash$ perl -e \
'for($i=1; $i<20; $i++) { print $i."\n"; }' | \
./brutus.py
(you'll run out of memory, you better Ctrl-C
while you're ahead)
|
My "R&D" directory has hundreds of
files, not just 20... we're in the area of billions, trillions,
zillions of combinations...
How do we cope?
Dynamic programmingIf you 've never heard of dynamic
programming and look it up in Wikipedia, you might get scared
from its technical description. You shouldn't. Its actually quite
easier to figure out what it is when you look at it from our -
somewhat limited - viewpoint.
Our problem (optimally choosing a subset of integers from a
given set to fill up a fixed size container) is one of a class of
problems that share a common trait: they are difficult to solve
in brute-force, but they allow you to easily find an optimal
solution for problem "N+1" if you know the solution for problem
"N".
Let's look at it in detail for our problem. We'll build a
table:
| Dynamic programming solver
for my "R&D" directory migration |
| |
1 |
2 |
3 |
... |
198 |
| 1 MB |
|
|
|
|
|
| 2 MB |
|
|
|
|
|
| 3 MB |
|
|
|
|
|
| ... |
| 4480 MB |
|
|
|
|
|
The row index provides the container size: we start from 1MB
and move all the way up to 4480MB. The column index is the
number of files we use from my directory to fill up the
container. I currently have a total of 198 files in the
directory, so the table has 198 columns.
The table stores in each cell the optimal (maximum) sum
of filesizes that we can achieve for each cell - that is, for
each combination of (container size, first j files). Notice that we have
arranged the files in a particular order, so
- in column j, we'll store the optimal result using only the
first j files
- in column j+1, the optimal result using only the first j+1
files
- etc...
This means that if we somehow manage to fill this table, then
the cell at the bottom right will contain
the optimal (maximum) sum we can get for our set of 198 files. To be more precise:
the value in cell (N,M) is the optimal cumulative size that we can get - that is, the
sum of the file sizes of the optimal set, for a container of size
N using the first M files - so it will always be less than (or
equal) to the row index, N (which is the size of the container).
Now imagine that we have filled up ALL lines up to line j, and
ALL cells of line j+1 up to column k. How do we fill cell (j+1,
k+1)?
| |
Column k |
Column k+1 |
| Line j |
a |
b |
| Line j+1 |
c |
? |
| ... |
That is, given that...
- line j provides us with the optimal (maximum) cumulative
filesize for a container of size j if we use k files ("a")
- line j provides us with the optimal (maximum) cumulative
filesize for a container of size j if we use k+1 files ("b")
- and that we have somehow managed to figure out the optimal
cumulative filesize for a container of size j+1 by using the first k
files ("c")
...how do we fare with the extra megabyte provided by line
j+1 if we also use file k+1? Can we pack "more stuff", and thus
increase the value from "c" to something bigger?
In order to fill cell (j+1,k+1) of line j+1, we need to see
whether the optimal sum of filesizes is the same as it was using
the k first files (i.e. the same as in cell (j+1,k), "c"), or whether
we can use the extra file (file k+1) and increase space coverage
by adding it to the mix.
To do that...
- we first check to see if file k+1 fits in a container
of size j+1. If not, we copy "c" from the cell on the left - that is,
the optimal result stays the same in this line between columns
k and k+1, since file k+1 can't fit in j+1 MB.
- if it can fit, we check if we can increase the space
covered by using the file k+1 (compared to column k, which uses
only the first k files and achieves "c"). This is the most complicated part:
If we use file k+1 in the container of size j+1, then the remaining
space will be j+1-filesizeOfFile(k+1). The optimal result we can
achieve for a container of that size, is known - we can simply look it up
in the table, since it is stored in a line above (remember, ALL lines
up to j+1 are filled-in); we then add filesizeOfFile(k+1) to that result, and check whether we
increased the coverage (i.e. whether we achieved a value greater than "c").
If we can't improve the optimal result of the previous
column (by using file k+1), we just copy the value from the cell on the left
(i.e. we use only the first k files).
import sys
g_maxSize=4480
def main():
fileSizes = []
for key in sys.stdin.readlines():
try:
key = int(key)
fileSizes.append(key)
except:
print "%s is not numeric" % key
continue
print "Total of ", len(fileSizes), "items"
print fileSizes
optimalResults = {}
for containerSize in xrange(0, g_maxSize+1):
for idx,fileSize in enumerate(fileSizes):
cellCurrent = (containerSize, idx)
cellOnTheLeftOfCurrent = (containerSize, idx-1)
if containerSize<fileSize:
optimalResults[cellCurrent] = \
optimalResults.get(cellOnTheLeftOfCurrent,0)
else:
remainingSpace = containerSize - fileSize
optimalResultOfRemainingSpace = \
optimalResults.get((remainingSpace, idx-1),0)
if optimalResultOfRemainingSpace + fileSize > \
optimalResults.get(cellOnTheLeftOfCurrent,0):
optimalResults[cellCurrent] = \
optimalResultOfRemainingSpace + fileSize
else:
optimalResults[cellCurrent] = \
optimalResults[cellOnTheLeftOfCurrent]
print "Attainable:", optimalResults[(g_maxSize, len(fileSizes)-1)]
if __name__ == "__main__":
main()
|
bash$ ./dynprog1.py
4400
10
50
70
(Pressing Ctrl-D)
Total of 4 items
[4400, 10, 50, 70]
Attainable: 4480
|
We've found the optimal result (4480) - but how do
we get to it? What series of files led us to it?
Simple - we augment the algorithm to not only keep the optimal sum per cell, but also the last
step we used to get it. To reproduce the steps that get us
to the final (optimal) result, we start from the optimal result, and go backwards, subtracting the
last "step" we used to reach each cell:
optimalResults = {}
lastStep = {}
...
if containerSize<fileSize:
optimalResults[cellCurrent] = \
optimalResults.get(cellOnTheLeftOfCurrent,0)
lastStep[cellCurrent] = \
lastStep.get(cellOnTheLeftOfCurrent,0)
else:
...
if optimalResultOfRemainingSpace + fileSize > \
optimalResults.get(cellOnTheLeftOfCurrent,0):
optimalResults[cellCurrent] = \
optimalResultOfRemainingSpace + fileSize
lastStep[cellCurrent] = fileSize
else:
optimalResults[cellCurrent] = \
optimalResults[cellOnTheLeftOfCurrent]
lastStep[cellCurrent] = \
lastStep.get(cellOnTheLeftOfCurrent,0)
...
total = optimalResult[(g_maxSize, len(fileSizes)-1)]
while total>0:
lastFileSize = lastStep[(total, len(fileSizes)-1)]
print total, "removing", lastFileSize
total -= lastFileSize
|
bash$ ./dynprog2.py
4400
10
50
70
(Pressing Ctrl-D)
Total of 4 items
[4400, 10, 50, 70]
Attainable: 4480
4480 removing 70
4410 removing 10
4400 removing 4400
|
And now we can enjoy the real fruits of our labour: feeding the
brute force implementation with the numbers 1-99 would lead to more
than 2^100 sets - the universe would end while we waited for results
to appear... Instead, our dynamic programming script only takes a couple of
seconds:(Note that numbers from 1 to 99
add up to 99*100/2 = 4950, so they accumulate to well above our
target of 4480 - and the optimal sum of 4480 is achieved by dropping
some files out, just as we wanted):
bash$ perl -e 'for($i=1; $i<100; $i++) { print $i."\n"; }' | \
./dynprog2.py
Total of 99 items
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32,
33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62,
63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92,
93, 94, 95, 96, 97, 98, 99]
Attainable: 4480
4480 removing 95
4385 removing 94
4291 removing 93
4198 removing 92
4106 removing 91
4015 removing 90
...
...
21 removing 6
15 removing 5
10 removing 4
6 removing 3
3 removing 2
1 removing 1
|
It works. We can now optimally fill storage space.
You can download an easy to use version here.
Usage is as plain as it gets:
bash$ fitToSize.py inputDirectory/ outputDirectory/ 4475
(you can choose the desired capacity, 4475 is just an example)
P.S. Using MBs to count is in fact cheating: the correct
solution is to use the allocation block of the filesystem as a
step size between lines. Well, it's close enough for my needs
(using sectors - 512 bytes - would make the table taller by a factor
of 2048x, and thus the execution would slow down by that same factor).
Edit, May 2011: An interesting
discussion
arose when I tried to implement
this algorithm in F# - F# turned out to be slower than Python!
|
|
|
|