ReadDB is client-server software for storing and accessing mapped
short reads.
Files
Running ReadDB
The jar file contains
both the server code and the client code. The following instructions
describe how to run the server and the client.
Since the jar file contains several main classes depending on whether
you want to run the server or a client class, you cannot run it
with java -jar. Instead, add the jar file to your
classpath or specify it on the command line with java -cp
readdb.jar. You should also put the Picard SamTools jar file
in your classpath as you'll need it to convert from SAM/BAM to
ReadDB's format.
If you've downloaded both files, you might add them to your classpath by running
export CLASSPATH=${HOME}/readdb-5-25-11.jar:${HOME}/sam-1.38.jar
(assuming you put the files in your home directory and used the default names).
Server Setup
First, create a directory for ReadDB that contains the following files
users.txt
This file describes the usernames and passwords that can access the
server. There is one line per user, in the format
username:password
The users.txt file is re-read each time a user attempts to authenticate,
so you can modify it while the server is running.
groups.txt
ReadDB allows you to create groups of users and assign rights to those
groups rather than the individuals. This file contains one line per
group, in the format
groupname: userone usertwo userthree
The special group "admin" is for users who can shut down the server,
add users, and add users to groups.
The groups file is read on startup, but there are API functions to add
users to groups.
defaultACL.txt
This is the default, initial ACL for new alignments. It looks like
read: userone usertwo public
write: userone usertwo
admin: userone usertwo
This file is re-read whenever an alignment is created and is used to
seed the ACL for the new alignment. admin is the set of users and
groups who can change the ACL for the alignment.
Starting the Server
Once users.txt, groups.txt, and defaultACL.txt are in place (eg, in
"datadir"), you can start the server with
java -Xmx1G edu.mit.csail.cgs.projects.readdb.Server -M 400 -d datadir -p 52000 -C 400
- -p is the port number to listen on
- -M is the number of connections to allow
- -C is the number of files to cache
- -t is the number of threads to spawn
- -d is the directory with users.txt, groups.txt, and defaultACL.txt. One directory per alignment will be created here.
The server will log on STDERR.
I haven't done extensive testing to correlate the java heap size and the
number of cached files. 3GB seems adequate for our usage and 400
files. Don't be too alarmed if you see high memory usage with top or
other tools- ReadDB uses mmap to access data files. The full file size
will be included in the process's virtual size even if the data isn't in
RAM.
Client Setup
The client software will look for a ~/.readdb_passwd file that looks
like
username=userone
passwd=useronepassword
hostname=readdb.csail.mit.edu
port=52000
Be sure to change the hostname and port to those that you're using for your Server.
Loading Data
You can load data by passing it on STDIN to
java edu.mit.csail.cgs.projects.readdb.ImportHits --align alignmentname
Lines for unpaired reads must be tab delimited with the following
fields:
- chromosome (integer)
- position
- strand
- length
- weight
Lines for paired reads must be tab delimited with the following fields
- chromosome for left read
- pos for left read
- strand for left read
- length for left read
- chromosome for right read
- pos for right read
- strand for right read
- length for right read
- weight (applies to the whole pair)
edu.mit.csail.cgs.projects.readdb.SAMToReadDB will convert from SAM/BAM
format to ReadDB, except that it does not convert string chromosome
identifiers to the numeric identifiers needed by ReadDB. Feel free to
modify it to suit your local convention for non-numeric chromosomes.
For a simple test, I used
java -cp /tmp/readdb.jar edu.mit.csail.cgs.projects.readdb.SAMToReadDB < small.against_hg19.bam | egrep '^chr[[:digit:]]+[[:space:]]' | sed -e 's/^chr//' | java -cp /tmp/readdb.jar edu.mit.csail.cgs.projects.readdb.ImportHits --align sample
And then some tests to make sure it loaded:
java edu.mit.csail.cgs.projects.readdb.ReadDB getchroms sample
java edu.mit.csail.cgs.projects.readdb.ReadDB getcount sample 3
echo "1:0-10000000" | java edu.mit.csail.cgs.projects.readdb.Query --align sample
Command Line Queries
edu.mit.csail.cgs.projects.readdb.Query is the basic query class. It
reads regions (eg, "4:1000000-2000000" or "15:0-10000:+") on STDIN and
produces output on STDOUT with either aligned read information or a
histogram.
- --align specifies the alignment to query
- --histogram 40 says to produce a histogram with 40bp bins
- --weights says to include alignment weights in the output
- --paired says to query paired reads rather than single-ended reads
- --noheader says not to print the queried region on STDOUT
- --bed makes individual hits be printed in BED format
- --wiggle makes histogram output in wiggle format
edu.mit.csail.cgs.projects.readdb.ReadDB provides
additional query functionality. The first argument is the command,
followed by any additional arguments. Commands are
exists alignmentname
getchroms alignmentname
getacl alignmentname
setacl alignmentname user add write
setacl alignmentname user delete read
getcount alignmentname (gets the number of hits in the alignment)
getcount alignmentname chromosome
addtogroup username groupname
Java and Perl API
Client.java and ReadDBClient.pm implement
Java and Perl interfaces to ReadDB. Client.java contains
the documentation and is the "official"
version. ReadDBClient.pm mimics the Java version (and
doesn't contain method documentation) and receives less use and
testing.
Contact the authors if you're interested in using ReadDB with GBrowse or
the UCSC genome browser. Some work has been done for the former and the
latter would definitely be of interest.
Documentation
The javadocs are in the jar file and online here.
Client.java
contains the API docs for the java client; ImportHits.java,
ReadDB.java, and Query.java are the command line clients.