2017年3月18日 星期六

【Hadoop】Hadoop 相關系列產品簡介

Apache Hadoop:用於處理大數據的分散式開源框架
Apache Hama:基於 BSP 計算技術的框架,可用於圖形、矩陣的大數據計算
Apache HBase:分散式 NoSQL 資料庫系統
Apache HCatalog:Hadoop 的資料表和儲存管理層,提供資料的關聯檢視
Apache Hive:HDFS 上面的分散式資料倉儲系統
Apache Mahout:機器學習開源框架
Apache Oozie:用於 Hadoop 的工作流程調度引擎
Apache Sqoop:結構化數據與 Hadoop 之間的資料轉換工具
Apache Pig:Hadoop 的大數據分析平台
Apache Whirr:運行於雲端服務的類別庫
Apache Zookeeper:提供開源的分散式配置服務、同步服務、協調服務
Apache Ambari:提供 Web UI 和 REST API,可簡化 Hadoop 叢集的管理和監控
Apache Avro:二進位的資料序列化系統
Apache Bigtop:Hadoop 生態系統的開發、打包和測試系統
Apache Cassandra:分散式 NoSQL 資料庫系統
Apache Chukwa:日誌收集、處理、分析系統
Apache Crunch:提供 Java API,能夠簡化撰寫、測試和執行 MapReduce Pipeline 程序
Apache Flume:分散式日誌收集系統
Apache Giraph:可擴展的分散式圖形處理系統

【Spark】Apache Spark 2.0 筆記

Spark 2.0 的特性

更簡單(Easier: ANSI SQL and Streamlined APIs)

      
  • Unifying DataFrames and Datasets in Scala/Java
  •   
  • SparkSession
  •   
  • Simpler, more performant Accumulator API
  •   
  • DataFrame-based Machine Learning API emerges as the primary ML API
  •   
  • Machine learning pipeline persistence
  •   
  • Distributed algorithms in R
  •   
  • User-defined functions (UDFs) in R

更快速(Faster: Apache Spark as a Compiler)

      
  • 搭載了第二代 Tungsten 引擎,此技術官方稱為「whole-stage code generation」

更聰明(Smarter: Structured Streaming)

      
  • Integrated API with batch jobs
  •   
  • Transactional interaction with storage systems
  •   
  • Rich integration with the rest of Spark

Spark 2.0 官方介紹影片:SPARK EAST SUMMIT in New York(2016/02/16 )

版本釋出記錄:

Spark 2.2.0(2017/07/11)
Spark 2.1.1(2017/05/02)
Spark 2.1.0(2016/12/28)
Spark 2.0.2(2016/11/14)
Spark 2.0.1(2016/10/03)
Spark 2.0.0(2016/07/26)

【Python】Codility in Python : Lesson 12 - Euclidean Algorithm【ChocolatesByNumbers】

Euclidean Algorithm 第一題:【ChocolatesByNumbers】
There are N chocolates in a circle. Count the number of chocolates you will eat.

Two positive integers N and M are given.
Integer N represents the number of chocolates arranged in a circle, numbered from 0 to N − 1.

You start to eat the chocolates. After eating a chocolate you leave only a wrapper.

You begin with eating chocolate number 0.
Then you omit the next M − 1 chocolates or wrappers on the circle, and eat the following one.

More precisely, if you ate chocolate number X,
then you will next eat the chocolate with number (X + M) modulo N (remainder of division).

You stop eating when you encounter an empty wrapper.

For example, given integers N = 10 and M = 4.
You will eat the following chocolates: 0, 4, 8, 2, 6.

The goal is to count the number of chocolates that you will eat, following the above rules.

Write a function:

  def solution(N, M)

that, given two positive integers N and M, returns the number of chocolates that you will eat.

For example, given integers N = 10 and M = 4. the function should return 5, as explained above.

Assume that:
  N and M are integers within the range [1..1,000,000,000].


方法一:參考 Maurice Dassen 大大的做法
Correctness:100%、Performance:100%
```python def solution(N, M): x = 0 while x != 1: x, y = N, M while y != 0: x, y = y, x % y N, M = N / x, M / x return N ```

完整練習題 source code 請參閱:github

2017年3月3日 星期五

【Python】Codility in Python : Lesson 11 - Sieve of Eratosthenes【CountSemiprimes】

Sieve of Eratosthenes 第一題:【CountSemiprimes】
Count the semiprime numbers in the given range [a..b]

A prime is a positive integer X that has exactly two distinct divisors: 1 and X.
The first few prime integers are 2, 3, 5, 7, 11 and 13.

A semiprime is a natural number that is the product of two (not necessarily distinct) prime numbers.
The first few semiprimes are 4, 6, 9, 10, 14, 15, 21, 22, 25, 26.

You are given two non-empty zero-indexed arrays P and Q, each consisting of M integers.
These arrays represent queries about the number of semiprimes within specified ranges.

Query K requires you to find the number of semiprimes within the range (P[K], Q[K]),
where 1 ≤ P[K] ≤ Q[K] ≤ N.

For example, consider an integer N = 26 and arrays P, Q such that:

  P[0] = 1 Q[0] = 26
  P[1] = 4 Q[1] = 10
  P[2] = 16 Q[2] = 20

The number of semiprimes within each of these ranges is as follows:

(1, 26) is 10,
(4, 10) is 4,
(16, 20) is 0.

Write a function:

  def solution(N, P, Q)

that, given an integer N and two non-empty zero-indexed arrays P and Q consisting of M integers,
returns an array consisting of M elements specifying the consecutive answers to all the queries.

For example, given an integer N = 26 and arrays P, Q such that:

  P[0] = 1 Q[0] = 26
  P[1] = 4 Q[1] = 10
  P[2] = 16 Q[2] = 20

the function should return the values [10, 4, 0], as explained above.

Assume that:
  N is an integer within the range [1..50,000];
  M is an integer within the range [1..30,000];
  each element of arrays P, Q is an integer within the range [1..N];
  P[i] ≤ Q[i].


方法一:用半質數相乘取出結果
Correctness:100%、Performance:40%
```python from math import sqrt def solution(N, P, Q): S = [0] * (N+1) # 先取得半質數列表 prime = [ p for p in range(2, int(N / 2) + 1) if 0 not in [ p % d for d in range(2, int(sqrt(p))+1)] ]   # 標註所有相乘結果 for i in prime: for j in prime: k = i * j if k <= N: S[k] = 1 # 取得範圍內有多少註記 result = [] for i in range(0, len(P)): result.append(len(list(filter(lambda x: x == 1, S[P[i]:Q[i]+1])))) return result ```

完整練習題 source code 請參閱:github

【Python】Codility in Python : Lesson 10 - Prime and composite numbers【Peaks】

Prime and composite numbers 第三題:【Peaks】
Divide an array into the maximum number of same-sized blocks,
each of which should contain an index P such that A[P - 1] < A[P] > A[P + 1].

A non-empty zero-indexed array A consisting of N integers is given.

A peak is an array element which is larger than its neighbors.
More precisely, it is an index P such that 0 < P < N − 1, A[P − 1] < A[P] and A[P] > A[P + 1].

For example, the following array A:

  A[0] = 1
  A[1] = 2
  A[2] = 3
  A[3] = 4
  A[4] = 3
  A[5] = 4
  A[6] = 1
  A[7] = 2
  A[8] = 3
  A[9] = 4
  A[10] = 6
  A[11] = 2

has exactly three peaks: 3, 5, 10.

We want to divide this array into blocks containing the same number of elements.
More precisely, we want to choose a number K that will yield the following blocks:

A[0], A[1], ..., A[K − 1],
A[K], A[K + 1], ..., A[2K − 1],
...
A[N − K], A[N − K + 1], ..., A[N − 1].

What's more, every block should contain at least one peak.
Notice that extreme elements of the blocks (for example A[K − 1] or A[K]) can also be peaks,
but only if they have both neighbors (including one in an adjacent blocks).

The goal is to find the maximum number of blocks into which the array A can be divided.

Array A can be divided into blocks as follows:

one block (1, 2, 3, 4, 3, 4, 1, 2, 3, 4, 6, 2). This block contains three peaks.
two blocks (1, 2, 3, 4, 3, 4) and (1, 2, 3, 4, 6, 2). Every block has a peak.
three blocks (1, 2, 3, 4), (3, 4, 1, 2), (3, 4, 6, 2). Every block has a peak.
Notice in particular that the first block (1, 2, 3, 4) has a peak at A[3],
because A[2] < A[3] > A[4], even though A[4] is in the adjacent block.

However, array A cannot be divided into four blocks,
(1, 2, 3), (4, 3, 4), (1, 2, 3) and (4, 6, 2),
because the (1, 2, 3) blocks do not contain a peak.
Notice in particular that the (4, 3, 4) block contains two peaks: A[3] and A[5].

The maximum number of blocks that array A can be divided into is three.

Write a function:

  def solution(A)

that, given a non-empty zero-indexed array A consisting of N integers,
returns the maximum number of blocks into which A can be divided.

If A cannot be divided into some number of blocks, the function should return 0.

For example, given:

  A[0] = 1
  A[1] = 2
  A[2] = 3
  A[3] = 4
  A[4] = 3
  A[5] = 4
  A[6] = 1
  A[7] = 2
  A[8] = 3
  A[9] = 4
  A[10] = 6
  A[11] = 2

the function should return 3, as explained above.

Assume that:
  N is an integer within the range [1..100,000];
  each element of array A is an integer within the range [0..1,000,000,000].


方法一:參考 martinkysel.com 提供的做法
Correctness:100%、Performance:100%
```python def solution(N): peaks = [] length = len(A) for i in range(1, length-1): if A[i-1] < A[i] > A[i+1]: peaks.append(i) count = len(peaks) if count == 0: return 0 for i in range(count, 0, -1): if len(A) % i == 0: blockSize = length // i found = [0] * i foundCount = 0 for peak in peaks: blockIndex = peak // blockSize if found[blockIndex] == 0: found[blockIndex] = 1 foundCount += 1 if foundCount == i: return i return 0 ```

完整練習題 source code 請參閱:github

2017年3月1日 星期三

【Python】Codility in Python : Lesson 10 - Prime and composite numbers【MinPerimeterRectangle】

Prime and composite numbers 第二題:【MinPerimeterRectangle】
Find the minimal perimeter of any rectangle whose area equals N.

An integer N is given, representing the area of some rectangle.

The area of a rectangle whose sides are of length A and B is A * B, and the perimeter is 2 * (A + B).

The goal is to find the minimal perimeter of any rectangle whose area equals N.
The sides of this rectangle should be only integers.

For example, given integer N = 30, rectangles of area 30 are:

(1, 30), with a perimeter of 62,
(2, 15), with a perimeter of 34,
(3, 10), with a perimeter of 26,
(5, 6), with a perimeter of 22.

Write a function:

  def solution(N)

that, given an integer N, returns the minimal perimeter of any rectangle whose area is exactly equal to N.

For example, given an integer N = 30, the function should return 22, as explained above.

Assume that:
  N is an integer within the range [1..1,000,000,000].


方法一:用平方根的方式推算
Correctness:100%、Performance:100%
```python def solutionBySqrt(N): squareRoot = math.sqrt(N) for index in range(math.ceil(squareRoot), 0, -1): if N % index == 0: return int(2 * (index + (N / index))) return -1 ```

完整練習題 source code 請參閱:github