Magicode logo
Magicode
2

BERTで英検を解く

BERT solves Eiken problems

Eiken (実用英語技能検定) is an English proficiency test conducted by a Japanese public-interest incorporated foundation (Link to wikipedia). One type of the questions in the test is a multiple choice problem to fill a blank in a sentence. For example:

My sister usually plays tennis (   ) Saturdays.

  1. by  2. on  3. with  4. at

Bob (   ) five friends to his party.

  1. made  2. visited  3. invited  4. spoke

In this notebook we solve this type of questions using pre-trained BERT models.

First, we use the masked language model, which is designed to guess a word in a blank in a sentence. A drawback of this approach is that the model cannot guess a word not included in its vocabulary set.

To handle unknown words, the second approach calculates perplexity scores of the sentences filled by choices. Since a lower perplexity score indicates the sentense is more "natural," we can pick the sentence with the lowest score as the answer.


Collecting openpyxl Downloading openpyxl-3.0.7-py2.py3-none-any.whl (243 kB)  |████████████████████████████████| 243 kB 815 kB/s [?25hCollecting et-xmlfile Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB) Installing collected packages: et-xmlfile, openpyxl Successfully installed et-xmlfile-1.1.0 openpyxl-3.0.7 WARNING: Running pip as root will break packages and permissions. You should install packages reliably by using venv: https://pip.pypa.io/warnings/venv


/opt/conda/lib/python3.7/site-packages/torchaudio/backend/utils.py:54: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail. '"sox" backend is being deprecated. '

Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/331M [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]





[{'sequence': 'HuggingFace is creating a tool that the community uses to solve NLP tasks.',
  'score': 0.17927570641040802,
  'token': 3944,
  'token_str': ' tool'},
 {'sequence': 'HuggingFace is creating a framework that the community uses to solve NLP tasks.',
  'score': 0.11349428445100784,
  'token': 7208,
  'token_str': ' framework'},
 {'sequence': 'HuggingFace is creating a library that the community uses to solve NLP tasks.',
  'score': 0.05243517830967903,
  'token': 5560,
  'token_str': ' library'},
 {'sequence': 'HuggingFace is creating a database that the community uses to solve NLP tasks.',
  'score': 0.034935519099235535,
  'token': 8503,
  'token_str': ' database'},
 {'sequence': 'HuggingFace is creating a prototype that the community uses to solve NLP tasks.',
  'score': 0.028602516278624535,
  'token': 17715,
  'token_str': ' prototype'}]

Help on method call in module transformers.pipelines.fill_mask:

__call__(*args, targets=None, top_k: Union[int, NoneType] = None, **kwargs) method of transformers.pipelines.fill_mask.FillMaskPipeline instance
    Fill the masked token in the text(s) given as inputs.
    
    Args:
        args (:obj:`str` or :obj:`List[str]`):
            One or several texts (or one list of prompts) with masked tokens.
        targets (:obj:`str` or :obj:`List[str]`, `optional`):
            When passed, the model will return the scores for the passed token or tokens rather than the top k
            predictions in the entire vocabulary. If the provided targets are not in the model vocab, they will be
            tokenized and the first resulting token will be used (with a warning).
        top_k (:obj:`int`, `optional`):
            When passed, overrides the number of predictions to return.
    
    Return:
        A list or a list of list of :obj:`dict`: Each result comes as list of dictionaries with the following keys:
    
        - **sequence** (:obj:`str`) -- The corresponding input with the mask token prediction.
        - **score** (:obj:`float`) -- The corresponding probability.
        - **token** (:obj:`int`) -- The predicted token id (to replace the masked one).
        - **token** (:obj:`str`) -- The predicted token (to replace the masked one).

RobertaForMaskedLM( (roberta): RobertaModel( (embeddings): RobertaEmbeddings( (word_embeddings): Embedding(50265, 768, padding_idx=1) (position_embeddings): Embedding(514, 768, padding_idx=1) (token_type_embeddings): Embedding(1, 768) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): RobertaEncoder( (layer): ModuleList( (0): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (1): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (2): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (3): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (4): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (5): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) ) (lm_head): RobertaLMHead( (dense): Linear(in_features=768, out_features=768, bias=True) (layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (decoder): Linear(in_features=768, out_features=50265, bias=True) ) )


[Problem(text='A: What is your {}? B: Kazumi Suzuki.', choices=['hour', 'club', 'date', 'name'], answer='name'), Problem(text='I know Judy. She can {} French very well.', choices=['see', 'drink', 'speak', 'open'], answer='speak'), Problem(text="A: Are your baseball shoes in your room, Mike? B: No, Mom. They're in my {} at school.", choices=['window', 'shop', 'locker', 'door'], answer='locker'), Problem(text='Mysister usually plays tennis {} Saturdays.', choices=['by', 'on', 'with', 'at'], answer='on'), Problem(text='My mother likes {}. She has many pretty ones in the garden.', choices=['sports', 'movies', 'schools', 'flowers'], answer='flowers'), Problem(text="Let's begin today's class. Open your textbooks to {} 22.", choices=['chalk', 'ground', 'page', 'minute'], answer='page'), Problem(text='Today is Wednesday. Tomorrow is {}.', choices=['Monday', 'Tuesday', 'Thursday', 'Friday'], answer='Thursday'), Problem(text='I usually read magazines {} home.', choices=['of', 'on', 'with', 'at'], answer='at'), Problem(text="A: It's ten o'clock, Jimmy. {} to bed. B: All right, Mom.", choices=['Go', 'Sleep', 'Do', 'Sit'], answer='Go'), Problem(text="A: Do you live {} Tokyo? B: Yes. It's a big city.", choices=['after', 'with', 'on', 'in'], answer='in')]


.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}


.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
The specified target token ` coldly` does not exist in the model vocabulary. Replacing with `Ġcold`.
The specified target token ` busily` does not exist in the model vocabulary. Replacing with `Ġbus`.
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
The specified target token ` oversleep` does not exist in the model vocabulary. Replacing with `Ġovers`.
The specified target token ` apron` does not exist in the model vocabulary. Replacing with `Ġa`.
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
The specified target token ` needlessly` does not exist in the model vocabulary. Replacing with `Ġneed`.
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
The specified target token ` barricade` does not exist in the model vocabulary. Replacing with `Ġbarric`.
The specified target token ` phobia` does not exist in the model vocabulary. Replacing with `Ġph`.
The specified target token ` officiated` does not exist in the model vocabulary. Replacing with `Ġoffic`.
The specified target token ` synthesized` does not exist in the model vocabulary. Replacing with `Ġsynthes`.
The specified target token ` disarmed` does not exist in the model vocabulary. Replacing with `Ġdis`.
The specified target token ` vigor` does not exist in the model vocabulary. Replacing with `Ġvig`.
The specified target token ` spotless` does not exist in the model vocabulary. Replacing with `Ġspot`.
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
The specified target token ` dispositions` does not exist in the model vocabulary. Replacing with `Ġdispos`.
The specified target token ` enactments` does not exist in the model vocabulary. Replacing with `Ġenact`.
The specified target token ` speculations` does not exist in the model vocabulary. Replacing with `Ġspec`.
The specified target token ` garish` does not exist in the model vocabulary. Replacing with `Ġgar`.
The specified target token ` jovial` does not exist in the model vocabulary. Replacing with `Ġj`.
The specified target token ` pompous` does not exist in the model vocabulary. Replacing with `Ġpomp`.
The specified target token ` diffident` does not exist in the model vocabulary. Replacing with `Ġdiff`.
The specified target token ` dirge` does not exist in the model vocabulary. Replacing with `Ġdir`.
The specified target token ` prelude` does not exist in the model vocabulary. Replacing with `Ġpre`.
The specified target token ` commune` does not exist in the model vocabulary. Replacing with `Ġcommun`.
The specified target token ` alleviating` does not exist in the model vocabulary. Replacing with `Ġallev`.
The specified target token ` plagiarizing` does not exist in the model vocabulary. Replacing with `Ġplagiar`.
The specified target token ` inoculating` does not exist in the model vocabulary. Replacing with `Ġinoc`.
The specified target token ` beleaguering` does not exist in the model vocabulary. Replacing with `Ġbe`.
The specified target token ` elucidation` does not exist in the model vocabulary. Replacing with `Ġeluc`.
The specified target token ` affront` does not exist in the model vocabulary. Replacing with `Ġaff`.
The specified target token ` impasse` does not exist in the model vocabulary. Replacing with `Ġimp`.
The specified target token ` ultimatum` does not exist in the model vocabulary. Replacing with `Ġult`.
The specified target token ` pillage` does not exist in the model vocabulary. Replacing with `Ġpill`.
The specified target token ` exalt` does not exist in the model vocabulary. Replacing with `Ġex`.
The specified target token ` acclimate` does not exist in the model vocabulary. Replacing with `Ġacc`.
The specified target token ` congenial` does not exist in the model vocabulary. Replacing with `Ġcongen`.
The specified target token ` delirious` does not exist in the model vocabulary. Replacing with `Ġdel`.
The specified target token ` measly` does not exist in the model vocabulary. Replacing with `Ġmeas`.
The specified target token ` implausible` does not exist in the model vocabulary. Replacing with `Ġimpl`.
The specified target token ` jeopardized` does not exist in the model vocabulary. Replacing with `Ġjeopard`.
The specified target token ` stowed` does not exist in the model vocabulary. Replacing with `Ġst`.
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}


Downloading: 0%| | 0.00/764 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.25G [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]


.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}


.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Discussion

コメントにはログインが必要です。