Update: I’ve published this as a gem. Install it with gem install redis-scheduler or check out the GitHub page.
Want to use Redis to schedule work items? Zsets seem like a natural fit,
but they’re actually fairly tricky to use for this purpose. Unlike
lists, Redis provides no blocking or atomic move primitives for zsets.
Additionally, set semantics don’t allow for multiple schedulings of the
same item. These problems are surmountable, but only with some effort.
So here is a basic scheduler for Redis, in Ruby. I think it does
everything you’d want from a production scheduler:
- You can schedule arbitrary items at arbitrary times.
- It supports multiple simultaneous readers and writers.
- An exception causes the in-process item to be rescheduled at the original time.
- A crash leaves the item in an error queue, from which it can later be recovered.
- You can iterate over ready items in either blocking or non-blocking mode.
An example invocation:
r = Redis.new
s = RedisScheduler.new r, blocking: true
startt = Time.now
s.schedule! "a", startt + 10
s.schedule! "b", startt + 15
s.schedule! "c", startt + 20
s.each { |item, at| puts "#{Time.now - startt}: #{item}" }
will print:
10.03481788: a
15.05255288: b
20.06058172: c
... waits forever ...
Here’s the code:
class RedisScheduler
include Enumerable
POLL_DELAY = 1.0
CAS_DELAY = 0.5
def initialize redis, opts={}
@redis = redis
@namespace = opts[:namespace]
@blocking = opts[:blocking]
@queue = [@namespace, "q"].join
@error_queue = [@namespace, "errorq"].join
@counter = [@namespace, "counter"].join
end
def schedule! item, time
id = @redis.incr @counter
@redis.zadd @queue, time.to_f, "#{id}:#{item}"
end
def reset!
[@queue, @error_queue, @counter].each { |k| @redis.del k }
end
def each
while(x = get)
item, at = x
begin
yield item, at
rescue Exception
schedule! item, at
raise
ensure
cleanup! item
end
end
end
private
def get; @blocking ? blocking_get : nonblocking_get end
def blocking_get
sleep POLL_DELAY until(x = nonblocking_get)
x
end
class InvalidEntryException < StandardError; end
def nonblocking_get
catch :cas_retry do
@redis.watch @queue
item, at = @redis.zrangebyscore @queue, 0, Time.now.to_f,
:withscores => true, :limit => [0, 1]
if item
@redis.multi do
@redis.zrem @queue, item
@redis.lpush @error_queue, item
end or begin
sleep CAS_DELAY
throw :cas_retry
end
item =~ /^\d+:(\S+)$/ or raise InvalidEntryException, item
item = $1
[item, Time.at(at.to_f)]
end
end
end
def cleanup! item
@redis.lrem @error_queue, 1, item
end
end
If you find this useful, please drop me a note.
I’ve released Whistlepig 0.10. This release
changes the index format in a backwards-incompatible way. Indexes created by
pre-0.10 Whistlepigs will not be readable by Whistlepig 0.10.
The good news going is that I have also added proper versioning on the
index formats now, and so providing upgrade paths will be possible in
the future, should I find it necessary to introduce further
backwards-incompatible changes.
This change was necessary primarily for multi-process support. As of
Whistlepig 0.10, you can have multiple processes reading from, and
writing to, the same index. This means that Whistlepig may finally be
useful for e.g., Rails apps, where you have multiple server processes
all reading from the same index.
Whistlepig provides this multi-process concurrency by using pthread
read-write locks. From a performance perspective, locking is not a very
good form of concurrency. The original plan, way back when I was
designing Whistlepig, was to use lock-free algorithms to allow
concurrent access (at least, single-writer and multi-reader) and remain
performant. Unfortunately, the C memory model isn’t as robust as the
JVM memory model and, as far as I can tell, doesn’t support the
atomicity guarantees necessary to accomplish this (see e.g. this Quora
question and my follow-up
comment).
FWIW, it look like the C11
standard
may address these issues (Jens Gustedt has more on
this)
but that’s not going to help me today.
So, locking it it is, for now. The read-write locking mechanism at least
allows multiple reader processes to access the same segment at the same
time, but a single writer will block any other writers and any other
readers. The implication is that if the proportion of your write traffic
is high relative to your read traffic, you will waste a lot of time
waiting to acquire locks. So if you need high-throughput writes from
Whistlepig, you probably don’t want to glom it all together into a
single index, and instead should shard documents across multiple
indexes. High-throughput reads with occasional writes should be fine.
The benefit that locking does provide is that it frees me up to
implement more sophisticated data structures for postings lists. Right
now Whistlepig essentially builds giant linked lists of all postings,
which is very naive and wasteful, but does lend itself to a lock-free
approach. If I am guaranteed exclusive write access, I can be much
smarter about how I represent the postings. The learning in traditional
IR applications is that smaller postings lists can improve search time
as well. I have some concrete ideas on how to do this that should find
themselves into future Whistlepig releases.
I’ve released version 0.9 of Whistlepig, my minimalist real-time full-text
search engine. This is a beta release. There
are no known bugs, but I wouldn’t use it to power the space shuttle just yet.
Changes since the previous release include:
- A new
Query#term_map method, which allows you to programmatically alter
queries after they’ve been parsed. The most obvious use case for this is
to do case-folding, which can be tricky to do on the unparsed string due
to the “OR” operator.
- Various bugfixes and some additional safety-check code.
Do a gem install whistlepig to get it, or download the tarball here.
leveldb-ruby 0.7 has been released.
This release fixed a double-freeing bug in DB#close, which caused a
segfault if the GC tried to reclaim the LevelDB object or when Ruby
terminated.
Found this message in my own commit logs for the console rubygem and I figured I’d post it here in case some other poor fool
is stuck writing code with mbrtowc:
Well this is a fun little feature of mbrtowc that I discovered! Give it one
bad input and it will barf on all future inputs for the rest of the program,
unless you pass in your own cleared shift state object.
The mbrtowc manpage is helpfully vague:
“If the multibyte string starting at s contains an invalid multibyte
sequence before the next complete character, mbrtowc() returns (size_t) -1 and
sets errno to EILSEQ. In this case, the effects on *ps are undefined.”
and
“If ps is a NULL pointer, a static anonymous state only known to the
mbrtowc function is used instead. Otherwise, *ps must be a valid mbstate_t
object.”
The reader is left to infer, of course, that the “undefined effects” of the
“static anonymous state only known to the mbrtowc function” are that of
“breaking all successive calls for the rest of the execution of the program”.
Edsger Dijkstra explains why array indexes should start at 0 rather than 1. [pdf]
Here’s the summary: first, let’s consider question of how to represent a range
of natural numbers. For example, consider the range of integers . There are four compact ways to represent this by varying whether we
use or :
Which is best? In the last two variants, we can subtract the first number from
the last to find the size of the range. Based on that favorable property,
let’s restrict our options to those two.
What if we instead wish to represent ? Our two candidates
give us:
The second representation forces us into the realm of unnatural numbers, which
is not ideal. So we have our winning representation: .
Let’s return to the original problem of indexing things. We have two
choices for representing the entire range, depending on whether we start at
or :
Indexing starting at requires a clumsy ; starting at allows
us to write the number of elements directly as part of the range. We conclude
that indexing starting at 0 allows us the greatest notational convenience.
Since this is something I’ve had to figure out 5 times already:
git filter-branch --env-filter '
if [ "$GIT_AUTHOR_EMAIL" = "old-email-address" ];
then
export GIT_AUTHOR_EMAIL="new-email-address";
fi
' HEAD
… will replace all occurrences of ‘old-email-address’ in the git commit email
field with ‘new-email-address’ on the current branch.
The usual caveats of rewriting history apply: your head will diverge and anyone
working off your branch is in for a hassle.
Thanks to some prompting from Aaron Gallagher, I’ve bundled up two years of
Whisper bugfixes into release 0.6.
This release requires Ruby 1.9. I have also published the gem on rubygems.org.
Unfortunately the name “whisper” was already taken by some ancient Rails gem,
so to get this latest release, please do:
gem install whisperblog
Send your comments and questions my way. It’s still a pretty homebrew project,
obviously.
Whistlepig 0.2 is out already. This time it should actually compile under OS X.
I’m having a grand old time figuring out all the differences in flags that Ruby
when compiling gems under different architectures. But not so grand that I want
to learn automake/autoconf.
As part of making sure things were compiling on different platforms, I ran
make test-integration and noticed that on my dual-boot laptop runs it at
8000k/s on Linux but only 6200k/s on OS X. I was expecting a difference in
Linux’s favor, but not quite that large! As another point of reference, my
mid-range Linux desktop gives me 9500 k/s.
Today I released the very first version of
Whistlepig, a minimalist realtime full-text
search index.
Side projects apparently take a lot longer when you have a job and a baby,
because it’s taken me over 6 months to get to the point where I have something
releasable. And there are so many obvious improvements to make. But all known
bugs are squashed, and it’s good enough to use, so, it’s out.
The README has a
good description of what Whistlepig is, so here I thought I’d talk about the
why. Why write yet another inverted index?
The unfortunate fact is that you have too many choices already: Lucene,
obviously, and its derivatives like SOLR, and if you’re shy of the JVM, Xapian
and Sphinx. Ferret used to be a good choice in the Ruby world until Dave
Balmain absconded and no one had the cojones to maintain his code. I’ve used
each of these things.
But they are all very heavy-weight solutions, and they all suffer from what I
call the “TREC mentality”. In early TREC competitions, you were given a big,
static corpus, which you indexed at your leisure, and then you were given
a bunch of queries, which were all long descriptions of what documents someone
was interested in. It would be something like “I am interested in documents
about Mayan architecture, but only during the pre-conquistador period, and
specifically I am not interested in such and such” and so on. These
competitions were great in that they spurred advances in search engineering,
but the result is that almost every inverted index implementation today is
optimized for precisely the case of static corpora and large queries.
In the intervening 30 years, the use case for full-text search has far exceeded
the library-science-style applications of the early TREC competitions. There
are many applications where you don’t need tf-idf scores and the Okapi formula
or even necessarily stemming. You just want recent things that match your
query, and you value control and transparency over some kind of fuzzy natural
language matching. Search in GMail (or Sup of, course!) comes to mind, or
searching within the posts on this blog.
That’s one part of the reason why existing solutions are not ideal. The other
part is that inverted indexes are so optimized for speed and for size that even
little things like wanting documents from last to first can be drastically
slower than using the standard ordering. For example, Sup wants documents in
reverse chronological order; Xapian is fastest in increasing docid order; so we
play crazy games to map dates to docids:
DOCID_SCALE = 2.0**32
TIME_SCALE = 2.0**27
MIDDLE_DATE = Time.gm(2011)
def assign_docid m, truncated_date
t = (truncated_date.to_i - MIDDLE_DATE.to_i).to_f
docid = (DOCID_SCALE - DOCID_SCALE /
(Math::E**(-(t/TIME_SCALE)) + 1)).to_i
while docid > 0 and docid_exists? docid
docid -= 1
end
docid > 0 ? docid : nil
end
This snippet is courtesy of Rich Lane, who should be credited in history books
as the first person to find a use for a logistic curve in an email client.
If you try and use something like Xapian or Sphinx for these applications, you
have to play games like that for performance. And when new documents arrive,
you have to play further games to get them into the index sooner rather than
later. And all the while you’re turning off 90% of the features anyways.
So that leads us to the world of realtime search, which explicitly values
recent documents over older ones. It’s “realtime” where new documents arrive on
the fly and must be made available to queries as soon as they arrive. If you’re
in that situation, you typically also care more about more recent documents
that older ones anyways. Those are the two tenets of realtime search: documents
are available immediately, and recent documents are more important than older
ones.
Whistlepig is my attempt to capture those two tenets in as few lines of
code as possible, while still being reasonable performant. I do this by
stripping away all the vestigial TREC functionality of relevance, ranking,
sorting, tf-idf, etc. You get documents in LIFO order, and that’s it.
Whistlepig doesn’t return anything besides the docid either: if you need
something more than the id, you have to fit that into a separate store
somewhere. It turns out if you throw that stuff away, you can accomplish the
rest of the search problem without a tremendous amount of code. Like any C
program, it’s 5% algorithm and 95% bookkeeping.
There is one wrinkle that I actually add to the model: I allow adding and
removing labels from documents. Every other aspect of a document is fixed in
Whistlepig—you can’t even delete it from the index once it’s been added—but
labels are mutable. And of course you can intermingle labels with the other
components of your query. Almost every realtime search application I can dream
up would benefit from this functionality, so there you go.
My hope for Whistlepig is that it becomes the default choice for realtime
search applications, especially in the Ruby world, which hasn’t had a good
in-process search solution since Ferret bit the dust. And if I mysteriously
disappear like Dave did, I also hope that the codebase is small enough and
simple enough that taking it over doesn’t seem like a herculean effort.