Magicode
4

# 第6章　解答

## 第6章: 機械学習

1. データの入手・整形
2. 特徴量抽出
3. 学習
4. 予測
5. 正解率の計測
6. 混同行列の作成
7. 適合率，再現率，F1スコアの計測
8. 特徴量の重みの確認
9. 正則化パラメータの変更
10. ハイパーパラメータの探索

### 50. データの入手・整形

News Aggregator Data Setをダウンロードし、以下の要領で学習データ（train.txt），検証データ（valid.txt），評価データ（test.txt）を作成せよ．

--2021-08-31 04:13:41-- https://archive.ics.uci.edu/ml/machine-learning-databases/00359/NewsAggregatorDataset.zip Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252 Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 29224203 (28M) [application/x-httpd-php] Saving to: ‘NewsAggregatorDataset.zip.1’

NewsAggregatorDatas 100%[===================>]  27.87M  26.4MB/s    in 1.1s

2021-08-31 04:13:42 (26.4 MB/s) - ‘NewsAggregatorDataset.zip.1’ saved [29224203/29224203]

Archive:  NewsAggregatorDataset.zip
replace 2pageSessions.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
inflating: 2pageSessions.csv
replace __MACOSX/._2pageSessions.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
inflating: __MACOSX/._2pageSessions.csv
replace newsCorpora.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
inflating: newsCorpora.csv
replace __MACOSX/._newsCorpora.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
inflating: __MACOSX/._newsCorpora.csv
replace readme.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace __MACOSX/._readme.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y

Attribute Information:

FILENAME #1: newsCorpora.csv (102.297.000 bytes)
DESCRIPTION: News pages
FORMAT: ID TITLE URL PUBLISHER CATEGORY STORY HOSTNAME TIMESTAMP

where:
ID Numeric ID
TITLE News title
URL Url
PUBLISHER Publisher name
CATEGORY News category (b = business, t = science and technology, e = entertainment, m = health)
STORY Alphanumeric ID of the cluster that includes news about the same story
HOSTNAME Url hostname
TIMESTAMP Approximate time the news was published, as the number of milliseconds since the epoch 00:00:00 GMT, January 1, 1970

FILENAME #2: 2pageSessions.csv (3.049.986 bytes)
DESCRIPTION: 2-page sessions
FORMAT: STORY HOSTNAME CATEGORY URL

where:
STORY Alphanumeric ID of the cluster that includes news about the same story
HOSTNAME Url hostname
CATEGORY News category (b = business, t = science and technology, e = entertainment, m = health)
URL Two space-delimited urls representing a browsing session


.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}


Train:10672 Valid:1334 Test:1334

### 51. 特徴量抽出

['00', '05', '07', '08', '09', '0ff', '0ut', '10', '100', '1000', '10000', '100000', '100k', '100th', '101', '103', '104', '106', '107', '108', '10k', '10m', '10million', '10th', '11', '110', '1100', '111', '113', '114', '115', '1150', '116', '117', '118', '11m', '12', '120', '1201178058', '121', '1270', '129', '13', '1300', '131', '1399983366398', '1399983366584', '1399983366926', '1399983367118', '1399983367406', '1399983367691', '1399985294553', '1399985294870', '1399985295162', '1399985295432', '13th', '14', '142136', '148', '148948', '149002', '14lb', '14m', '14th', '15', '150', '1500', '1550', '156000', '15lbs', '15m', '15th', '16', '16000', '16k', '16m', '17', '17000', '172', '175', '17500', '178', '179', '18', '18000', '186f', '19', '1900', '19000', '1914', '1918', '1950s', '1956', '1964', '1978', '1979', '1980s', '1981', '1983', '1987', '1990', '1990s', '1996', '1997', '1999', '19m', '19th', '1bn', '1d', '1m', '1q', '1st', '20', '200', '2000', '20000', '200000', '2001', '2003', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2023', '2030', '2049', '2064', '20headlines', '20th', '21', '2100', '210714', '210715', '210716', '210717', '210718', '210719', '210720', '210721', '210722', '210723', '210m', '212', '215', '218000', '21st', '22', '225', '22lbs', '23', '230', '230000', '23435', '24', '24000', '240000', '249', '24million', '25', '250', '25000', '252867', '254', '255', '26', '26k', '26th', '27', '28', '280', '281000', '283', '284000', '28728', '288k', '28th', '29', '290', '2900', '291', '295', '29k', '2billion', '2bntoyota', '2dayfm', '2million', '2nd', '2p', '2â', '30', '300', '3000', '30000', '300ft', '300k', '30th', '31', '318m', '32', '320', '325', '33', '337', '34', '35', '354', '35s', '36', '360', '37', '370', '3700', '37000', '377', '38', '3800', '3billion', '3d', '3m', '3q', '3rd', '40', '400', '4000', '400k', '400m', '400million', '401', '4012', '40m', '40m...too many strings

size of tfidf vector 13460

### 52. 学習

51で構築した学習データを用いて，ロジスティック回帰モデルを学習せよ．

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 2.9min remaining: 0.0s [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 2.9min finished

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=1000,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=200, solver='lbfgs', tol=0.0001, verbose=100,
warm_start=False)


### 53. 予測

52で学習したロジスティック回帰モデルを用い，与えられた記事見出しからカテゴリとその予測確率を計算するプログラムを実装せよ．

[array([0.94987387, 0.87020651, 0.91319897, ..., 0.6472321 , 0.60348582, 0.87017636]), array(['b', 'e', 'b', ..., 'b', 't', 'e'], dtype=object)]

### 54. 正解率の計測

52で学習したロジスティック回帰モデルの正解率を，学習データおよび評価データ上で計測せよ．

Train accuracy:0.9452773613193404 Valid accuracy:0.8808095952023988 Test accuracy:0.9010494752623688

### 55. 混同行列の作成

52で学習したロジスティック回帰モデルの混同行列（confusion matrix）を，学習データおよび評価データ上で作成せよ．

### 56. 適合率，再現率，F1スコアの計測

52で学習したロジスティック回帰モデルの適合率，再現率，F1スコアを，評価データ上で計測せよ．カテゴリごとに適合率，再現率，F1スコアを求め，カテゴリごとの性能をマイクロ平均（micro-average）とマクロ平均（macro-average）で統合せよ．

.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}


### 57. 特徴量の重みの確認

52で学習したロジスティック回帰モデルの中で，重みの高い特徴量トップ10と，重みの低い特徴量トップ10を確認せよ．

business high: ['fed' 'china' 'bank' 'stocks' 'ecb' 'update' 'euro' 'ukraine' 'yellen' 'profit'] low: ['the' 'and' 'ebola' 'her' 'she' 'apple' 'kardashian' 'video' 'study' 'google']

entertaiment
high: ['kardashian' 'chris' 'star' 'she' 'kim' 'miley' 'cyrus' 'movie' 'paul'
'thrones']
'billion']

health
high: ['ebola' 'study' 'fda' 'drug' 'cancer' 'mers' 'cases' 'heart' 'could'
'outbreak']

science & technology
'mobile' 'nasa']
low: ['stocks' 'fed' 'ecb' 'shares' 'her' 'day' 'men' 'kardashian' 'drug'
'ukraine']


### 58. 正則化パラメータの変更

ロジスティック回帰モデルを学習するとき，正則化パラメータを調整することで，学習時の過学習（overfitting）の度合いを制御できる．異なる正則化パラメータでロジスティック回帰モデルを学習し，学習データ，検証データ，および評価データ上の正解率を求めよ．実験の結果は，正則化パラメータを横軸，正解率を縦軸としたグラフにまとめよ．