読者です 読者をやめる 読者になる 読者になる

GroundHog tutorial を動かすぞい

pylearn2 を開発している lisa-lab が Theano を使って開発している RNN 実装をインストールして, tutorial を動かす話

結構辛かったのでメモしておく...

環境

install

  • pip で色々インストールする
    • apt-get で python-dev, blas, lapack を入れてから,Theano をインストールする
    • Theano 入れる過程でだいたいインストールされて,tables を入れるときに Cython, numexpr を追加でインストールした気がする
$ pip list
apt-xapian-index (0.45)
argparse (1.2.1)
chardet (2.0.1)
colorama (0.2.5)
configobj (4.7.2)
Cython (0.22)
groundhog (0.1dev, /home/laughing/GroundHog)
html5lib (0.999)
numexpr (2.4)
numpy (1.9.2)
PAM (0.4.2)
pip (1.5.4)
pyOpenSSL (0.13)
pyserial (2.6)
python-apt (0.9.3.5ubuntu1)
python-debian (0.1.21-nmu2ubuntu2)
requests (2.2.1)
scipy (0.15.1)
setuptools (3.3)
six (1.5.2)
ssh-import-id (3.21)
tables (3.1.1)
Theano (0.6.0)
Twisted-Core (13.2.0)
urllib3 (1.7.1)
wsgiref (0.1.2)
zope.interface (4.0.5)

データセットの用意と加工

enwik8 とか使うらしい が,サイズが大きく処理が重くなりそうなので alice29.txt を使う

train, valid, test と3つのデータが必要らしいので,シンボリックリンクを張ってサボる

$ ln alice29.txt train
$ ln alice29.txt valid
$ ln alice29.txt test
$ $ ll 
total 152K
-r-------- 1 laughing laughing 149K Mar  8 18:21 alice29.txt
lrwxrwxrwx 1 laughing laughing   11 Mar  8 18:22 test -> alice29.txt
lrwxrwxrwx 1 laughing laughing   11 Mar  8 18:22 train -> alice29.txt
lrwxrwxrwx 1 laughing laughing   11 Mar  8 18:22 valid -> alice29.txt

npz 形式のデータを作る

$ python generate.py --dest=data_chars --level=chars ~/alice
Constructing the vocabulary ..
 .. sorting words
 .. shrinking the vocabulary size
EOL 0
Constructing train set
Constructing valid set
Constructing test set
Saving data
... Done

tutorial を実行

path や設定値を変更してから, CPU モードで DT_RNN_Tut.py を実行する

n_in や n_out はデータによって変えてねと issue のコメント にあるので変更

また,一定回数の学習後に sample を表示する際に, データによって index error が発生する場合があるので,そこをコメントアウト

$ git diff
diff --git a/tutorials/DT_RNN_Tut.py b/tutorials/DT_RNN_Tut.py
index e6e83d8..eb46ed0 100644
--- a/tutorials/DT_RNN_Tut.py
+++ b/tutorials/DT_RNN_Tut.py
@@ -94,8 +94,8 @@ def jobman(state, channel):
         state['n_in'] = 10000
         state['n_out'] = 10000
     else:
-        state['n_in'] = 50
-        state['n_out'] = 50
+        state['n_in'] = 100
+        state['n_out'] = 100
     train_data, valid_data, test_data = get_text_data(state)
 
     ## BEGIN Tutorial
@@ -254,7 +254,8 @@ def jobman(state, channel):
         sample = sample_fn()[0]
         print 'Sample:',
         if state['chunks'] == 'chars':
-            print "".join(dictionary[sample])
+            if len(sample) < len(dictionary):
+                print "".join(dictionary[sample])
         else:
             for si in sample:
                 print dictionary[si],
@@ -298,8 +299,8 @@ if __name__=='__main__':
     state = {}
     # complete path to data (cluster specific)
     state['seqlen'] = 100
-    state['path']= "/data/lisa/data/PennTreebankCorpus/pentree_char_and_word.npz"
-    state['dictionary']= "/data/lisa/data/PennTreebankCorpus/dictionaries.npz"
+    state['path']= "data_chars.npz" #"/data/lisa/data/PennTreebankCorpus/pentree_char_and_word.npz"
+    state['dictionary']= "data_chars_dict.npz" #"/data/lisa/data/PennTreebankCorpus/dictionaries.npz"
     state['chunks'] = 'chars'
     state['seed'] = 123

実行

$  THEANO_FLAGS=device=cpu,floatX=float32 python DT_RNN_Tut.py

動く

THEANO_FLAGS=device=cpu,floatX=float32 python DT_RNN_Tut.py
data length is  152089
data length is  152089
data length is  152089
/usr/local/lib/python2.7/dist-packages/theano/tensor/subtensor.py:110: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
  start in [None, 0] or
/usr/local/lib/python2.7/dist-packages/theano/tensor/subtensor.py:114: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
  stop in [None, length, maxsize] or
/usr/local/lib/python2.7/dist-packages/theano/tensor/subtensor.py:190: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
  if stop in [None, maxsize]:
/usr/local/lib/python2.7/dist-packages/theano/tensor/opt.py:2165: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
  if (replace_x == replace_y and
/usr/local/lib/python2.7/dist-packages/theano/scan_module/scan_perform_ext.py:85: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility
  from scan_perform.scan_perform import *
/usr/local/lib/python2.7/dist-packages/theano/sandbox/rng_mrg.py:768: UserWarning: MRG_RandomStreams Can't determine #streams from size (Elemwise{Cast{int32}}.0), guessing 60*256
  nstreams = self.n_streams(size)
Constructing grad function
Compiling grad function
took 25.5895009041
Validation computed every 1000
Saving the model...
Model saved, took 0.990880966187
.. iter    0 cost 4.328 grad_norm 3.86e+00 log2_p_word 6.24e+00 log2_p_expl 6.24e+02 step time  0.042 sec whole time 27.825 sec lr 1.00e+00
Sample: .. iter  100 cost 3.166 grad_norm 5.84e-01 log2_p_word 4.57e+00 log2_p_expl 4.57e+02 step time  0.162 sec whole time 48.088 sec lr 1.00e+00
.. iter  200 cost 3.211 grad_norm 5.07e-01 log2_p_word 4.63e+00 log2_p_expl 4.63e+02 step time  0.102 sec whole time 68.236 sec lr 1.00e+00
.. iter  300 cost 3.022 grad_norm 5.09e-01 log2_p_word 4.36e+00 log2_p_expl 4.36e+02 step time  0.213 sec whole time 88.404 sec lr 1.00e+00
.. iter  400 cost 2.765 grad_norm 3.39e-01 log2_p_word 3.99e+00 log2_p_expl 3.99e+02 step time  0.041 sec whole time 108.597 sec lr 1.00e+00
.. iter  500 cost 2.912 grad_norm 1.73e-01 log2_p_word 4.20e+00 log2_p_expl 4.20e+02 step time  0.043 sec whole time  2.147 min lr 1.00e+00
.. iter  600 cost 2.532 grad_norm 1.75e-01 log2_p_word 3.65e+00 log2_p_expl 3.65e+02 step time  0.219 sec whole time  2.480 min lr 1.00e+00
.. iter  700 cost 2.467 grad_norm 1.73e-01 log2_p_word 3.56e+00 log2_p_expl 3.56e+02 step time  0.163 sec whole time  2.814 min lr 1.00e+00
.. iter  800 cost 2.758 grad_norm 1.94e-01 log2_p_word 3.98e+00 log2_p_expl 3.98e+02 step time  0.383 sec whole time  3.156 min lr 1.00e+00
.. iter  900 cost 2.561 grad_norm 1.93e-01 log2_p_word 3.70e+00 log2_p_expl 3.70e+02 step time  0.181 sec whole time  3.487 min lr 1.00e+00
.. iter 1000 cost 2.318 grad_norm 1.94e-01 log2_p_word 3.34e+00 log2_p_expl 3.34e+02 step time  0.042 sec whole time  3.821 min lr 1.00e+00
**  0     validation: cost:3.674834  ppl:12.771302 whole time  5.045 min patience 1
>>>         Test cost: 3.675  ppl:12.771

2日で終わっていた

No testing 1.9251015647 > 1.9251015647
testcost 1.9251015647
testtime 2881.70835375
testppl 3.79763582524
Saving the model...
Model saved, took 1.07689595222
Took 2883.40501855 min
Average step took 0.158221706748
That amounts to 546069.194776 sentences in a day
Average log2 per example is 198.125991821

Groundhog 7.5

Groundhog 7.5" by Fiesta by Fiesta [並行輸入品]