FC2ブログ

スポンサーサイト

上記の広告は1ヶ月以上更新のないブログに表示されています。
新しい記事を書く事で広告が消せます。

GO FOR IT の 暗号検索の高速化 に挑戦する

前回に引き続き GO FOR IT に挑戦します.挑戦する問題は第3問の,暗号検索の高速化です.

Keywords: Python, 暗号, 高速化


問題の概略を説明します.ある規則によって生成されるランダム文字列には,先頭 208728 文字目から 4162 文字スキップで "sony" の文字列が存在します.文字列の規則は以下の通りです.
r <- (r * 1103515245 + 12345) & 0xFFFFFFFF
rand_str[i] <- 0x61 + (26 * (r / 0x10000)) / 0x10000
i <- i + 1
この暗号文字列では最小スキップの文字列が重要な意味をもち,2 文字以上のキーワードがランダム文字列中のどこにあるかを最小スキップ順に高速に求めたいとします.求められる仕様として,結果は最小スキップ順にソートされていること(スキップ数が同じならば,暗号文字列の先頭に近いもの),キーワード文字列が逆順に一致した場合も出力することがあります.これを

(i) 実行速度にこだわらず,読みやすいコードを書く
(ii) アルゴリズムを工夫し,検索処理を高速化する

として解答します.


(i) のアルゴリズム
まずパターン文字列の最初に一致する位置を探し,見つかったら 2 番目の文字に一致する位置を探します.2 つの位置からスキップ数が決定するので,順番にスキップしながら文字の一致を調べていきます.最後の文字まで一致したら結果に格納し,再び最初の文字をランダム文字列中から探していきます.

(ii) のアルゴリズム
まずパターン文字列の最初と最後に一致する位置をすべて調べます.問題の条件より最初と最後の位置から決定される区間は特定のスキップ数で割りきれるはずなので,mod 演算して評価を続行するか調べます.評価を継続する場合は単純にパターン文字列と比較します.
このアルゴリズムでは,文字の位置の組み合わせ次第では最初の文字の位置が最後の文字より後になることが重要です.このケースでは暗号文字列の先頭方向にスキップし,パターンに一致した場合は逆順に一致したことになります.これにより (i) に比べて倍程度の効率化が見込めます.


プログラムは暗号検索のコア部分を Python の拡張モジュールとして実装しました.コアの処理は nosetail.c と ayatori.c に,利用部分は cipher_main.py に記述してあります.導入手順が複雑になるので package を配布します.

goforit.zip

<作業環境>
Windows7 Professional 64bit
Windows SDK v7.0
Python 2.6.6
numpy 1.6.1

<必須(と思われる)環境>
Python 2.6.x または 2.7.x
numpy 1.6 以上
msvc や gcc などプラットフォームに合わせた C のビルド環境

<実行方法>
まず拡張モジュールをインストールします.以下のコマンドを,setup.py のあるディレクトリに移動し実行してください.
python setup.py install
これで実行可能になるはずなので,cipher_main.pyを実行してください.
python cipher_main.py
キーワードを要求されるので,入力すると 2 つの手法で検索を開始し,それぞれに要した時間を表示します.検索結果は nosetail-"keyword".txt と ayatori-"keyword".txt のファイル名で出力されます.

私の環境では標準出力に次のように出力されました.改良型のほうが実行時間が約 1/3 程度になっています.
Enter pattern: sony
0:00:07.412000 nose to tail method
0:00:02.213000 ayatori method


最後にソースコードを示します.今回のソースコードは MIT ライセンスとします.nosetail.c と general.py の find_nosetail 部分を (i) への,ayatori.c と general.py のfind_ayatori を (ii) への解答とします.以下に Python 側から見える C 関数の対応を示します.

nosetail.find (general.py) -> nosetail_find (nosetail.c)
ayatori.find (general.py) -> ayatori_find (ayatori.c)


nosetail.c
/*
* Copyright (c) 2012 Akihiko Ishida, http://codeit.blog.fc2.com/
*
* Permission is hereby granted, free of charge, to any person obtaining
* a copy of this software and associated documentation files (the
* "Software"), to deal in the Software without restriction, including
* without limitation the rights to use, copy, modify, merge, publish,
* distribute, sublicense, and/or sell copies of the Software, and to
* permit persons to whom the Software is furnished to do so, subject to
* the following conditions:
*
* The above copyright notice and this permission notice shall be
* included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
* LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
* OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
* WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/

#include <Python.h>

#include <stdio.h>
#include <stdlib.h>


static const size_t RESULTSIZE_INIT = 5000;
static const size_t RESULTSIZE_INCR = 1000;

static size_t *result = NULL;
static size_t result_index = 0;
static size_t result_count = 0;

static char *src = NULL;
static size_t src_len = 0;

static char *pattern = NULL;
static size_t pattern_len = 0;


static size_t get_result_size(void)
{
return sizeof(size_t) * result_count * 2;
}

static int resize_result_buff(void)
{
if(result_index >= result_count) {
result_count += RESULTSIZE_INCR;
result = (size_t*)realloc(result, get_result_size());

if(result == NULL) {
PyErr_NoMemory();
return -1;
}
}
return 0;
}

static int pattern_match(size_t begin, size_t skip)
{
size_t i, index;

for(i = 2; i < pattern_len; i++) {
index = begin + skip * i;

if(index >= src_len) {
return 0;
}
else if(pattern[i] != src[index]) {
return 0;
}
}

return 1;
}

static int find_2nd_character(size_t begin)
{
size_t i, index, skip;
int agreed = 0;

for(i = begin + 1; i < src_len; i++) {
if(src[i] == pattern[1]) {
// Get skip offset
skip = i - begin;

if(pattern_len > 2) {
agreed = pattern_match(begin, skip);
} else {
agreed = 1;
}

if(agreed > 0) {
// Store a pair of begin and skip
index = 2 * result_index;
result[index] = begin;
result[index + 1] = skip;
result_index++;

if(resize_result_buff() < 0) {
return -1;
}
}
}
}

return 0;
}

static int find_pattern(void)
{
size_t i;

// Memory allocation
result_count = RESULTSIZE_INIT;
result_index = 0;
result = (size_t*)malloc(get_result_size());

if(result == NULL) {
PyErr_NoMemory();
return -1;
}

for(i = 0; i < src_len; i++) {
// Find first character
if(src[i] == pattern[0]) {
if(find_2nd_character(i) < 0) return -1;
}
}

result_count = result_index;

return 0;
}

static PyObject* get_result_object(void)
{
PyObject *byte;

if(result == NULL) {
return NULL;
}

byte = PyByteArray_FromStringAndSize((char*)result, get_result_size());

free(result);
result = NULL;
result_index = 0;
result_count = 0;

return byte;
}


static PyObject* nosetail_find(PyObject *self, PyObject *args)
{
int err = PyArg_ParseTuple(args, "ss", &src, &pattern);
if(!err) {
return NULL;
}

src_len = strlen(src);
pattern_len = strlen(pattern);

if(find_pattern() < 0) {
return NULL;
}

return get_result_object();
}

static PyObject* nosetail_size_t(PyObject *self, PyObject *args)
{
return Py_BuildValue("I", sizeof(size_t));
}

static PyMethodDef nosetail_functions[] = {
{"find", nosetail_find, METH_VARARGS,
""},
{"size_t", nosetail_size_t, METH_NOARGS,
""},
{0}
};

#ifndef PyMODINIT_FUNC
#define PyMODINIT_FUNC void
#endif
PyMODINIT_FUNC initnosetail(void)
{
PyObject *m = Py_InitModule3(
"nosetail", nosetail_functions,
"Provide find pattern function based on nose to tail method."
);
}

ayatori.c
/*
* Copyright (c) 2012 Akihiko Ishida, http://codeit.blog.fc2.com/
*
* Permission is hereby granted, free of charge, to any person obtaining
* a copy of this software and associated documentation files (the
* "Software"), to deal in the Software without restriction, including
* without limitation the rights to use, copy, modify, merge, publish,
* distribute, sublicense, and/or sell copies of the Software, and to
* permit persons to whom the Software is furnished to do so, subject to
* the following conditions:
*
* The above copyright notice and this permission notice shall be
* included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
* LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
* OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
* WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/

#include <Python.h>

#include <stdio.h>
#include <stdlib.h>


static const size_t RESULTSIZE_INIT = 5000;
static const size_t RESULTSIZE_INCR = 1000;

static const size_t INDEXSIZE_INIT = 10000;
static const size_t INDEXSIZE_INCR = 5000;


static char *src = NULL;
static size_t src_len = 0;

static char *pattern = NULL;
static size_t pattern_len = 0;

static size_t *idx0 = NULL;
static size_t idx0_count = 0;
static size_t idx0_index = 0;

static size_t *idx1 = NULL;
static size_t idx1_count = 0;
static size_t idx1_index = 0;

static size_t *ret0 = NULL;
static size_t ret0_count = 0;
static size_t ret0_index = 0;

static size_t *ret1 = NULL;
static size_t ret1_count = 0;
static size_t ret1_index = 0;


// buffer resizing
static size_t get_result_size(const size_t count)
{
return sizeof(size_t) * count * 2;
}

static size_t* resize_result_buff(size_t *buff, size_t count)
{
return realloc(buff, get_result_size(count));
}

static size_t* resize_idx_buff(size_t *buff, size_t count)
{
return realloc(buff, sizeof(size_t) * count);
}


// text pattern matching
static int pattern_match(size_t begin, size_t skip)
{
size_t i;
for(i = 1; i < pattern_len - 1; i++) {
if(pattern[i] != src[begin + skip * i]) return 0;
}
return 1;
}

static int pattern_match_reversed(size_t begin, size_t skip)
{
size_t i;
for(i = 1; i < pattern_len - 1; i++) {
if(pattern[i] != src[begin - skip * i]) return 0;
}
return 1;
}


// list all index where first or last character of pattern
static int find_indices(const char c0, const char c1)
{
size_t i;

for(i=0; i<src_len; i++) {
if(c0 == src[i]) {
idx0[idx0_index] = i;
idx0_index++;

if(idx0_index >= idx0_count) {
idx0_count += INDEXSIZE_INCR;
idx0 = resize_idx_buff(idx0, idx0_count);

if(idx0 == NULL) { // Not enough memory
PyErr_NoMemory();
return -1;
}
}
}
if(c1 == src[i]) {
idx1[idx1_index] = i;
idx1_index++;

if(idx1_index >= idx1_count) {
idx1_count += INDEXSIZE_INCR;
idx1 = resize_idx_buff(idx1, idx1_count);

if(idx1 == NULL) { // Not enough memory
PyErr_NoMemory();
return -1;
}
}
}
}

idx0_count = idx0_index;
idx1_count = idx1_index;

return 0;
}


// check possible combination
static int check_combination(size_t index0, size_t index1, size_t step)
{
size_t i, skip;

if(index1 > index0) {
skip = index1 - index0;

if((skip % step) == 0) {
skip /= step;

if(pattern_match(index0, skip)) {
i = ret0_index * 2;
ret0[i] = index0;
ret0[i + 1] = skip;
ret0_index++;

if(ret0_index >= ret0_count) {
ret0_count += RESULTSIZE_INCR;
ret0 = resize_result_buff(ret0, ret0_count);

if(ret0 == NULL) { // Not enough memory
PyErr_NoMemory();
return -1;
}
}
}
}
}
else {
skip = index0 - index1;

if((skip % step) == 0) {
skip /= step;

if(pattern_match_reversed(index0, skip)) {
i = ret1_index * 2;
ret1[i] = index1;
ret1[i + 1] = skip;
ret1_index++;

if(ret1_index >= ret1_count) {
ret1_count += RESULTSIZE_INCR;
ret1 = resize_result_buff(ret1, ret1_count);

if(ret1 == NULL) { // Not enough memory
PyErr_NoMemory();
return -1;
}
}
}
}
}

return 0;
}


// find possible combination, and pass to checking function
static int find_combination(void)
{
size_t i0, i1, step = pattern_len - 1;

ret0_count = RESULTSIZE_INIT;
ret0_index = 0;
ret0 = (size_t*)malloc(get_result_size(ret0_count));

ret1_count = RESULTSIZE_INIT;
ret1_index = 0;
ret1 = (size_t*)malloc(get_result_size(ret1_count));

// Main process
for(i0=0; i0<idx0_count; i0++) {
for(i1=0; i1<idx1_count; i1++) {
if(idx0[i0] == idx1[i1]) {
continue;
}

if(check_combination(idx0[i0], idx1[i1], step) < 0) {
return -1;
}
}
}

ret0_count = ret0_index;
ret1_count = ret1_index;

return 0;
}


// interface function
static int find_pattern(void)
{
idx0_count = INDEXSIZE_INIT;
idx0_index = 0;
idx0 = (size_t*)malloc(sizeof(size_t) * idx0_count);

idx1_count = INDEXSIZE_INIT;
idx1_index = 0;
idx1 = (size_t*)malloc(sizeof(size_t) * idx1_count);

// Get indices of first and last characters
if(find_indices(pattern[0], pattern[pattern_len - 1]) < 0) {
return -1;
}

//pattern_step = pattern_len - 1;

if(find_combination() < 0) {
return -1;
}

free(idx0);
free(idx1);

return 0;
}


// Create output objects
static PyObject* create_bytearray(size_t *buff, size_t count)
{
return PyByteArray_FromStringAndSize((char*)buff, get_result_size(count));
}

static PyObject* create_result_tuple(void)
{
PyObject *result0, *result1;

result0 = create_bytearray(ret0, ret0_count);
result1 = create_bytearray(ret1, ret1_count);

free(ret0);
free(ret1);

return Py_BuildValue("(OO)", result0, result1);
}


// Python interface
static PyObject* ayatori_find(PyObject *self, PyObject *args)
{
int err = PyArg_ParseTuple(args, "ss", &src, &pattern);
if(!err) {
return NULL;
}

src_len = strlen(src);
pattern_len = strlen(pattern);

if(find_pattern() < 0) {
return NULL;
}

return create_result_tuple();
}

static PyObject* ayatori_size_t(PyObject *self, PyObject *args)
{
return Py_BuildValue("I", sizeof(size_t));
}

static PyMethodDef ayatori_functions[] = {
{"find", ayatori_find, METH_VARARGS,
""},
{"size_t", ayatori_size_t, METH_NOARGS,
""},
{0}
};

#ifndef PyMODINIT_FUNC
#define PyMODINIT_FUNC void
#endif
PyMODINIT_FUNC initayatori(void)
{
PyObject *m = Py_InitModule3(
"ayatori", ayatori_functions,
"Provide find pattern function based Ayatori like search."
);
}

general.py
"""Copyright (c) 2012 Akihiko Ishida, http://codeit.blog.fc2.com/

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
"""

import numpy as np
import nosetail
import ayatori


def find_nosetail(src, pattern):
size_t = nosetail.size_t()

# Determine data type for sorting. Platform dependent.
if size_t == 4:
datatype = [('index', np.uint32), ('skip', np.uint32)]
elif size_t == 8:
datatype = [('index', np.uint64), ('skip', np.uint64)]
else:
raise OSError, 'Unknown pointer size'

# Find from nose to tail
r0 = nosetail.find(src, pattern)
if len(r0) == 0:
r0 = np.empty(0, dtype=datatype)
else:
r0 = np.frombuffer(r0, dtype=datatype)
r0.sort(order=['skip', 'index'])

# Check palindrome
pattern_r = pattern[::-1]

if pattern == pattern_r:
return r0, r0

# Find from tail to nose
r1 = nosetail.find(src, pattern_r)
if len(r1) == 0:
r1 = np.empty(0, dtype=datatype)
else:
r1 = np.frombuffer(nosetail.find(src, pattern_r), dtype=datatype)
r1.sort(order=['skip', 'index'])

return r0, r1


def find_ayatori(src, pattern):
size_t = ayatori.size_t()

# Determine data type for sorting. Platform dependent.
if size_t == 4:
datatype = [('index', np.uint32), ('skip', np.uint32)]
elif size_t == 8:
datatype = [('index', np.uint64), ('skip', np.uint64)]
else:
raise OSError, 'Unknown pointer size'

r0, r1 = ayatori.find(src, pattern)

if len(r0) == 0:
r0 = np.empty(0, dtype=datatype)
else:
r0 = np.frombuffer(r0, dtype=datatype)
r0.sort(order=['skip', 'index'])

if len(r1) == 0:
r1 = np.empty(0, dtype=datatype)
else:
r1 = np.frombuffer(r1, dtype=datatype)
r1.sort(order=['skip', 'index'])

return r0, r1

cipher_main.py
"""Copyright (c) 2012 Akihiko Ishida, http://codeit.blog.fc2.com/

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
"""

from datetime import datetime
import numpy
import goforit.cipher as ciph

def random_data(r, maxiter):
s = numpy.empty(maxiter).astype(numpy.uint8)
for i in xrange(maxiter):
r = (r * 1103515245 + 12345) & 0xFFFFFFFF
s[i] = 0x61 + (26 * (r / 0x10000)) / 0x10000
return s

def random_string(r, maxiter):
return random_data(r, maxiter).tostring()

def save_txt(filename, data):
f = open(filename, 'w')
f.write('%u\n' % len(data))
for d in data:
f.write('%u\t%u\n' % (d[0], d[1]))
f.close()

def main():
pattern = raw_input('Enter pattern: ')
pattern_r = pattern[::-1]

src = random_string(16, 300000)

t0 = datetime.now()
r0, r1 = ciph.find_nosetail(src, pattern)
t1 = datetime.now()
r2, r3 = ciph.find_ayatori(src, pattern)
t2 = datetime.now()

save_txt('nosetail-%s.txt' % pattern, r0)
save_txt('nosetail-%s.txt' % pattern_r, r1)
save_txt('ayatori-%s.txt' % pattern, r2)
save_txt('ayatori-%s.txt' % pattern_r, r3)

print t1 - t0, 'nose to tail method'
print t2 - t1, 'ayatori method'

if __name__ == '__main__':
main()

他に __init__.py がありますが,ほぼ内容が無いので省略します.
スポンサーサイト

テーマ : プログラミング
ジャンル : コンピュータ

コメントの投稿

非公開コメント

プロフィール

Ishida Akihiko

Author:Ishida Akihiko
FC2ブログへようこそ!

免責事項
当サイトに掲載する記事内容は,必ずしも正確性,信頼性,妥当性,有用性,完成度などを保証しません.記事の利用はすべて自己責任でお願いします.当サイトに掲載された内容によって発生したいかなる損害に対しても,管理人は一切の責任を負いかねます.
最新記事
最新コメント
最新トラックバック
月別アーカイブ
カテゴリ
アクセスカウンター
検索フォーム
RSSリンクの表示
リンク
ブロとも申請フォーム

この人とブロともになる

QRコード
QR
上記広告は1ヶ月以上更新のないブログに表示されています。新しい記事を書くことで広告を消せます。