<div>Thank you, Thibaut.</div><div><br></div>I have 75k scaffolds because I don't discard any contig from ABySS-SSPACE pipeline. Now I realize that it is a problem.<div>You mean Zebra fish is a better projection reference for fish genome? I'll try it.</div>

<div><br></div><div>stats of my genome:</div><div><br></div><div><div><div>all:</div><div>  n:200   n:N50  min    median   mean   N50            max        sum</div><div>  75084   164    200    256          7573   1010578    5780275  568.6e6</div>

</div></div><div><br></div><div>scaffolds longer than 1000bp:</div><div><div><div>  n:1000  n:N50  min   median   mean     N50         max        sum</div><div>  2602    154    1000   16652    210525  1052979  5780275  547.7e6 </div>

</div><div><div><br></div></div><div>I think it is ok. </div><div><br></div><div>Another question: should I run RepeatMasker for my  core database before the 2X alignment pipeline?</div><div><br></div><div><br></div><div>

<div class="gmail_quote">On Tue, Nov 1, 2011 at 7:48 PM, Thibaut Hourlier <span dir="ltr"><<a href="mailto:th3@sanger.ac.uk">th3@sanger.ac.uk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

Hi,<br>

Can you check the numbers you have for the assembly because it seems a<br>

little bit odd that you have a N50 around 1M and only 2600 scaffold longer<br>

than 1000bp while you have around 75k scaffolds.<br>

There are different things you can do:<br>

 - you can use the zebrafish assembly, it might be far away in the<br>

taxonomy but the assembly and the annotation is more comprehensive, it<br>

includes models from RNA-Seq data and manual annotations.<br>

 - as your scaffolds are small they shouldn't take a long time to run so<br>

you can use a high batch_size. Just test it with few sequences, see how<br>

long it takes to run the analysis with 1000 scaffolds for example and<br>

adjust the batch size accordingly. In our environment we try to have jobs<br>

running for 1h when we have a lot of jobs.<br>

<br>

You should use the 2X alignment documentation for this step.<br>

<br>

Also, if your assembly is too fragmented, the pipeline will take a really<br>

long time for a result which probably won't be good. In that case it might<br>

better to wait to have more data.<br>

<br>

Regards<br>

Thibaut<br>

<br>

<br>

On Mon, 31 Oct 2011 14:29:43 +0800, Zhang Di <<a href="mailto:aureliano.jz@gmail.com">aureliano.jz@gmail.com</a>><br>

wrote:<br>

<div><div></div><div class="h5">> Hi,<br>

>   As described previously, I'm trying to run the low coverage annotation<br>

> pipeline for our Illumina GAII sequenced fish genome (~800m).<br>

>   The doc low_coverage_gen_build.txt tells me to prepare my own compara<br>

db,<br>

> so I go to encembl-compara.<br>

>   For my fish genome, I have ~75k scaffolds (length >= 200bp, N50 ~1M),<br>

> among which 2600 scaffolds are longer than 1000bp. my ref genome is<br>

> stickleback, and I followed the README-pairaligner doc.<br>

>   As the ref genome has ~2000 chunks (size 1M), there will be 2000 X<br>

75000<br>

> = 150M pairaligner jobs. too many to run in my institute.<br>

>   here are my questions:<br>

>   1. should I only use these scaffolds longer than 1000bp?<br>

>   2. am I followed the right doc? Which doc should I read to produce<br>

such a<br>

> alignment that: 'each bp in the target genome should be represented<br>

>   at most once' (cited from low_coverage_gene_build.txt). I don't quite<br>

> understand the README-2xalignment and<br>

README-low-coverage-genome-aligner.<br>

><br>

> Thank you<br>

><br>

> Best reguards<br>

<br>

<br>

--<br>

</div></div> The Wellcome Trust Sanger Institute is operated by Genome Research<br>

 Limited, a charity registered in England with number 1021457 and a<br>

 company registered in England with number 2742969, whose registered<br>

 office is 215 Euston Road, London, NW1 2BE.<br>

</blockquote></div><br><br clear="all"><div><br></div>-- <br>Zhang Di<br>

</div></div>