首頁/ 遊戲/ 正文

谷歌董事會主席John Hennessy:AI技術發展放緩,我們正處於半導體產業寒冬 | 鈦媒體T-EDGE

谷歌董事會主席John Hennessy:AI技術發展放緩,我們正處於半導體產業寒冬 | 鈦媒體T-EDGE

50年前,英特爾創始人之一戈登·摩爾提出了摩爾定律:積體電路上可容納的電晶體數目,約每隔24個月便會增加一倍。但近兩年,關於摩爾定律是否失效的爭論不斷。

與摩爾定律(Moore‘s law)相伴而生的是登納德縮放定律(Dennard scaling),即隨著電晶體密度的增加,每個電晶體的功耗會下降,因此每平方毫米矽的功耗幾乎是恆定的。但登納德縮放定律在2007年開始顯著放緩,到2012年幾乎失效。

也就是說,半導體技術的更新迭代顯然已經無法帶來飛躍的效能增長,即使多核設計也沒有顯著改善能效方面的問題。這種情況下,能否找到更加高效的利用方法?未來半導體行業又會出現哪些變化趨勢?

圍繞這一問題,在鈦媒體和國家新媒體產業基地聯合主辦的2021 T-EDGE全球創新大會上,Google母公司Alphabet董事會主席、2017年美國圖靈獎獲得者、斯坦福大學原校長John Hennessy發表了題為《深度學習和半導體技術領域的趨勢和挑戰》演講。

在他看來,實現更高的效能改進需要新的架構方法,從而更有效地使用積體電路功能。具體的解決方案有三個可能的方向:

1、以軟體為中心的機制。即著眼於提高軟體的效率,以便更有效地利用硬體;

2、以硬體為中心的方法。也稱為特定領域架構或特定領域加速器;

3、以上兩類的部分結合。開發出與這些特定架構相匹配的語言,讓人們更有效地開發應用程式。

在這樣的變化之下,John Hennessy認為,“未來通用處理器將不是驅動行業發展的主力,能夠與軟體快速聯動的特定領域處理器將會逐漸發揮重大作用。因此,接下來或許會看到一個更垂直的行業,會看到擁有深度學習和機器學習模型的開發者與作業系統和編譯器的開發者之間更垂直的整合,使他們的程式能夠有效執行、有效地訓練以及進入實際使用。”

以下為

John Hennessy

演講實錄,經鈦媒體編輯整理:

Hello I’m John Hennessy, professor of computer science and electrical engineering at Stanford University, and co-winner of the Turing Award in 2017。

大家好,我是約翰·軒尼詩,斯坦福大學計算機科學與電氣工程教授,也是2017 年圖靈獎共同獲得者。

It‘s my pleasure to participate in the 2021 T-EDGE conference。

很高興能參加 2021年的 T-EDGE 大會。

Today I’m going to talk to you about the trends and challenges in deep learning and semiconductor technologies, and how these two technologies want a critical building block for computing and the other incredible new breakthroughs in how we use computers are interacting, conflicting and how they might go forward。

今天我想談談深度學習和半導體技術領域的趨勢和挑戰、這兩種技術需要的關鍵突破、以及計算機領域的其他重大突破和發展方向。

AI has been around for roughly 60 years and for many years it continues to make progress but at a slow rate, much lower than many of the early prophets of AI had predicted。

人工智慧技術已經存在大約 60 年,多年來持續發展。但是人工智慧技術的發展開始放緩,發展速度已遠低許多早期的預測。

And then there was a dramatic breakthrough around deep learning for several small examples but certainly AlphaGo defeating the world’s go champion at least ten years before it was expected was a dramatic breakthrough。 It relied on deep learning technologies, and it exhibited what even professional go players would say was creative play。

在深度學習上我們實現了重大突破。最出名的例子應該就是 AlphaGo 打敗了圍棋世界冠軍,這個成果要比預期早了至少十年。Alpha Go使用的就是深度學習技術,甚至連專業人類棋手也誇讚Alpha Go的棋藝頗具創意。

That was the beginning of a world change。

這是鉅變的開端。

Today we‘ve seen many other deep learning breakthroughs where deep learning is being used for complex problems, obviously crucial for image recognition which enables self-driving cars, becoming more and more useful in medical diagnosis, for example, looking at images of skin to tell whether or not a lesion is cancerous or not, and applications in natural language particularly around machine translation。

今天,深度學習也在其他領域取得重大突破,被應用於解決複雜的問題。其中最明顯的自然是影象識別技術,它讓自動駕駛技術成為可能。影象識別技術在醫學診斷中也變得越來越有用,可透過檢視面板影象判斷是否存在癌變。除此之外,還有在自然語言處理中的應用,尤其是在機器翻譯方面頗具成果。

Now for Latin-based language basically being as good as professional translators and improving constantly for Chinese to English, a much more challenging translation problem but we are seeing even a significant progress。

目前,拉丁語系的機器翻譯基本上能做到和專業翻譯人員相似的質量。在更具挑戰的漢英翻譯方面上,機器翻譯也有不斷改進,我們已經能看到顯著的進步。

Most recently we’ve seen AlphaFold 2, a deep minds approach to using deep learning for protein folding, which advanced the field by at least a decade in terms of what is doable in terms of applying this technology to biology and going to dramatically change the way we make new drug discovery in the future。

近期我們也有 AlphaFold 2,一種使用深度學習進行蛋白質結構預測的應用,它將深度學習與生物學進行結合,讓該型別的應用進步了至少十年,將極大程度地改變藥物研發的方式。

What drove this incredible breakthrough in deep learning? Clearly the technology concepts have been around for a while and in fact many cases have been discarded earlier。

是什麼讓深度學習取得了以上突破?顯然,這些技術概念已經存在一段時間了,在某種程度上也曾被拋棄過。

So why was it able to make this breakthrough now?

那麼為什麼現在我們能夠取得突破呢?

First of all, we had massive amounts of data for training。 The Internet is a treasure trove of data that can be used for training。 ImageNet was a critical tool for training image recognition。 Today, close to 100,000 objects are on ImageNet and more than 1000 images per object, enough to train image recognition systems really well。 So that was the key。

首先是我們有了大量的資料用於訓練AI。網際網路是資料的寶庫。例如 ImageNet ,就是訓練影象識別的重要工具。現在ImageNet 上有近 100,000 種物體的影象,每種物體有超過 1000 張影象,這足以讓我們很好地訓練影象識別系統。這是重要變化之一。

Obviously we have lots of other data were using here for whether it‘s protein folding or medical diagnosis or natural language we’re relying on the data that‘s available on the Internet that’s been accurately labeled to be used for training。

我們當然也使用了其他大量的資料,無論是蛋白質結構、醫學診斷還是自然語言處理方面,我們都依賴網際網路上的資料。當然,這些資料需要被準確標記才能用於訓練。

Second, we were able to marshal mass of computational resources primarily through large data centers and cloud-based computing。 Training takes hours and hours using thousands of specialized processors。 We simply didn‘t have this capability earlier。 So that was crucial to solving the training problem。

第二,大型資料中心和雲計算給我們帶來了大量的運算資源。使用數千個專用處理器進行人工智慧訓練只需要數小時就能完成,我們之前根本沒有這種能力。因此,算力也是一個重要因素。

I want to emphasize that training is the computational intensive problem here。 Inferences are much simpler by comparison and here you see the rate of growth of performance demand in petaflops days needed to train a series of models here。 If you look at training AlphaZero for example requires 1000 petaflops days, roughly a week on the largest computers available in the world。

我想強調的是,人工智慧訓練帶來的問題是密集的算力需求,程式推理變得簡單得多。這裡展示的是訓練人工智慧模型的效能需求增長率。以訓練 AlphaZero 為例,它需要 1000 pfs-day,也就是說用世界上最大規模的計算機來訓練要用上一週。

This speed has been growing actually faster than Moore’s law。 So the demand is going up faster than what semiconductors ever produced even in the very best era。 We‘ve seen 300,000 times increase in compute from training simple models like AlexNet up to AlphaGo Zero and new models like GPT-3 had billions of parameters that need to be set。 So the training in the amount of data they have to look at is truly massive。 And that’s where the real challenge comes。

這個增長率實際上比摩爾定律還要快。因此,即使在半導體行業最鼎盛的時代,需求的增長速度也比半導體生產的要快。從訓練 AlexNet 這樣的簡單模型到 AlphaGo Zero,以及 GPT-3 等新模型,有數十億個引數需要進行設定,算力已經增加了 300,000 倍。這裡涉及到的資料量是真的非常龐大,也是我們需要克服的挑戰。

Moore‘s law, the version that Gordon Moore gave in 1975, predicted that semiconductor density would continue to grow quickly and basically double every two years but we began to diverge from that。 Really quickly diverge began in around 2000 and then the spread is growing even wider。 As Gordon said in the 50th anniversary of the first prediction: no exponential is forever。 Moore’s law is not a theorem or something that‘s definitely must hold true。 It’s an ambition which the industry was able to focus on and keeping tag。 If you look at this curve, you‘ll notice that for roughly 50 years we drop only a factor of 15 while gaining a factor of more than almost 10,000。

摩爾定律,即戈登摩爾在 1975 年給出的版本,預測半導體密度將繼續快速增長,基本上每兩年翻一番,但我們開始偏離這一增長速度。偏離在2000 年左右出現,並逐步擴大。戈登在預測後的五十年後曾說道:沒有任何的物理事物可以持續成倍改變。當然,摩爾定律不是定理或必須成立的真理,它是半導體行業的一個目標。仔細觀察這條曲線,你會注意到在大約 50 年中,我們僅偏離了約 15 倍,但總共增長了近 10,000 倍。

So we’ve largely been able to keep on this curve but we began diverging and when you factor in increasing cost of new fab and new technologies and you see this curve when it‘s converted to price per transistor not dropping nearly as fast as it once fell。

所以我們基本上能夠維持在這條曲線上,但我們確實開始跟不上了。如果你考慮到新晶圓廠和新技術的成本增加,當它轉換為每個電晶體的價格時,你會看到這條曲線的下降速度不像曾經下降的那麼快。

We also have faced another problem, which is the end of so-called dennard scaling。 Dennard scaling is an observation led by Robert Dennard, the inventor of DRAM that is ubiquitous in computing technology。 He observes that as dimensions shrunk so would the voltage and other assonance for example。 And that would result in nearly constant power per millimeter of silicon。 That meant because of the amount of transistors that were in each millimeter we’re going up dramatically from one generation to the next, that power per computation was actually dropping quite quickly。 That really came to a halt around 2007 and you see this red curb which was going up slowly at the beginning between 2000 and 2007 really began to take off。 That meant that power was really the key issue and figuring out how to get energy efficiency would become more and more important as these technologies went forward。

我們還面臨另一個問題,即所謂的登納德縮放定律。登納德縮放定律是由羅伯特·登納德 領導的一項觀察實驗,他是DRAM的發明人。據他的觀察,隨著尺寸縮小,電壓和其他共振也會縮小,這將導致每毫米矽的功率幾乎恆定。這意味著由於每一毫米中的電晶體數量從一代到下一代急劇增加,每個計算的功率實際上下降得非常快。這在 2007 年左右最為明顯,在 2000 年到 2007 年間開始緩慢上升的功耗開始激增。這意味著功耗確實是關鍵問題,隨著這些技術的發展,弄清楚如何獲得更高的能源效率將變得越來越重要。

Combine results of this is that we‘ve seen a leveling off of unit processor performance, single core performance, after going through a rapid growth in the early period of the industry of roughly 25% a year and then a remarkable period with the introduction of RISC technologies, instructional-level parallelism, of over 50% a year and then a slower period which focused very much on multicore and building on these technologies。

在經歷了行業早期每年大約 25% 的增長之後,隨著 RISC 技術的引入和指令級並行技術的出現,開始有每年超過 50% 的效能增長。之後我們就迎來了多核時代,專注於在現有技術上進行深耕。

In the last two years, only less than 5% improvement in performance per year。 Even if you were to look at multicore designs with the inefficiencies that come about you see that that doesn’t significantly improve things across this。

在過去的兩年中,每年的效能提升不到 5%,即使多核設計也沒有顯著改善能效方面的問題。

And indeed we are in the we are in the era of dark silicon where multicore often slow down or shut off a core to prevent overheating and that overheating comes from power consumption。

事實上,我們正處於半導體寒冬。多核處理器還是會因為擔心過熱而限制自身的效能。而過熱的問題就來自功耗。

So what are we going to do? We‘re in this dilemma here where we’ve got a new technology deep learning which seems able to do problems that we never thought we could do quite effectively。 But it requires massive amounts of computing power to go forward and at the same time Moore‘s law on the end of Dennard Scaling is creating a squeeze on the ability of the industry to do what it relies on for many years, namely just get the next generation of semiconductor technology everything gets faster。

那麼我們能做什麼呢?我們在這裡陷入了兩難境地,我們擁有一項新技術,深度學習,它似乎能夠高效地解決很多問題,但同時它需要大量的算力才能進步。同時,一邊我們有著登納德縮放定律,一邊有著摩爾定律,我們再也不能期待半導體技術的更新迭代能給我們帶來飛躍的效能增長。

So we have to think about a new solution。 There are three possible directions to go。

因此,我們必須考慮新的解決方案。這裡有三個可能的方向。

Software centric mechanisms where we look at improving the efficiency of our software so it makes more efficient use of the hardware, in particular the move to scripting languages such as python for example better dynamically-typed。 They make programming very easy but they’re not terribly efficient as you will see in just a second。

以軟體為中心的機制。我們著眼於提高軟體的效率,以便更有效地利用硬體,特別是指令碼語言,例如 python。這些語言讓程式設計變得非常簡單,但它們的效率並不高,接下來我會詳細解釋。

Hardware centric approaches。 Can we change the way we think about the architecture of these machines to make them much more efficient? This approach is called domain specific architectures or domain specific accelerator。 The idea is to just do a few tasks but to tune the hardware to do those tasks extremely well。 We‘ve already seen examples of this in graphics for example or modem that’s inside your cell phone。 Those are special purpose architectures that use intensive computational techniques but are not general purpose。 They are not programmed for arbitrary things。 They are not designed to do a range of graphics operations or the operation is required by modem。

以硬體為中心的方法。我們能否改變我們對硬體架構的設計,使它們更加高效?這種方法稱為特定領域架構或特定領域加速器。這裡的設計思路是讓硬體做特定的任務,然後最佳化要非常好。我們已經在圖形處理或手機內的調變解調器中看到了這樣的例子。這些使用的是密集計算技術,不是用於通用運算的,這也意味著它們不是設計來做各種各樣的運算,它們旨在進行圖形操作的安排或調變解調器需要的運算。

And then of course some combinations of these。 Can we come up with languages which match to these new domain specific architecture? Domain specific languages which improve the efficiency and let us code a range of applications very effectively。

最後是以上兩類的一些結合。我們是否能開發出與這些特定架構相匹配的語言?特定領域語言可以提高效率,讓我們非常有效地開發應用程式。

This is a fascinating slide from a paper that was done by Charles Leiserson and his colleagues at MIT and publish on Science called There‘s plenty of room at the Top。

這是查理·雷瑟森和他在麻省理工學院的同事完成發表在《科學》雜誌上的一篇論文內容。論文名為“頂端有足夠的空間”。

What they want to do observe is that software efficiency and the inefficiency of matching software to hardware means that we have lots of opportunity to improve performance。 They took admittedly a very simple program, matrix multiply, written initially in python and ran it on an 18 core Intel processor。 And simply by rewriting the code from python to C they got a factor of 47 in improvement。 Then introducing parallel loops gave them another factor of approximately eight。

他們想要觀察的是軟體效率,以及軟體與硬體匹配過程中帶來的低效率,這也意味著我們有很多提高效率的地方。他們在 18 核英特爾處理器上運行了一個用 Python 編寫的簡單程式。把程式碼從 Python 重寫為 C語言之後,他們就得到了 47 倍的效率改進。引入並行迴圈後,又有了大約 8 倍的改進。

Then introducing memory optimizations if you’re familiar with large scale metrics multiplied by doing it in blocked fashion you can dramatically improve the ability to use the cashe as effectively and thereby they got another factor a little under 20 from that about 15。 And then finally using the vector instructions inside the Intel processor they were able to gain another factor of 10。 Overall this final program runs more than 62,000 times faster than the initial python program。

引入記憶體最佳化後可以顯著提高快取的使用效率,然後就又能獲得15~20倍的效率提高。然後最後使用英特爾處理器內部的向量指令,又能夠獲得10 倍的改進。總體而言,這個最終程式的執行速度比最初的 Python 程式快62,000 多倍。

Now this is not to say that you would get this for the larger scale programs or all kinds of environments but it‘s an example of how much inefficiency is in at least for one simple application。 Of course not many performance sensitive things are written in Python but even the improvement from C to the fully parallel version of C that uses SIMD instructions is similar to what you would get if you use the domain specific processor。 It is significant just in its onw right。 That’s nearly a factor of 100, more than 100, its almost 150。

當然,這並不是說在更大規模的程式或所有環境下我們都可以取得這樣的提升,但它是一個很好的例子,至少能說明一個簡單的應用程式也有效率改進空間。當然,沒有多少效能敏感的程式是用 Python 寫的。但從完全並行、使用SIMD 指令的C語言版本程式,它能獲得的效率提升類似於特定領域處理器。這已經是很大的效能提升了,這幾乎是 100 的因數,超過 100,幾乎是 150。

So there‘s lots of opportunities here and that’s the key point behind us slide of an observation。

所以提升空間是很多的,這個研究的發現就是如此。

So what are these domain specific architecture? Their architecture is to achieve higher efficiency by telling the architecture the characteristics of the domain。

那麼特定領域架構是什麼呢?這些架構能讓架構掌握特定領域的特徵來實現更高的效率。

We‘re not trying to do just one application but we’re trying to do a domain of applications like deep learning for example like computer graphics like virtual reality applications。 So it‘s different from a strict ASIC that is designed to only one function like a modem for example。

我們在做的不只是一個應用程式,而是在嘗試做一個應用程式領域,比如深度學習,例如像虛擬現實、圖形處理。因此,它不同於ASIC,後者設計僅具有一個功能,就例如調變解調器。

It requires more domain specific knowledge。 So we need to have a language which conveys important properties of the application that are hard to deduce if we start with a low level language like C。 This is a product of codesign。 We design the applications and the domain specific processor together and that’s critical to get these to to work together。

它需要更多特定領域的知識。所以我們需要一種語言來傳達應用程式的重要屬性,如果我們從像 C 這樣的語言開始就很難推斷出這些屬性。這是協同設計的產物。我們一起設計應用程式和特定領域的處理器,這對於讓它們協同工作至關重要。

Notice that these are not going to be things on which we run general purpose applications。 It‘s not the intention that we take 100 C code。 It’s the intention that we take an application design to be run on that particular DSA and we use a domain specific language to convey the information to the application to the processor that it needs to get significant performance improvements。

請注意,這不是用來執行通用軟體的。我們的目的不是要能夠執行100 個 C 語言程式。我們的目的是讓應用程式設計在特定的 DSA 上執行,我們使用特定領域的語言將應用程式的資訊傳達給處理器,從而獲得顯著的效能提升。

The key goal here is to achieve higher efficiency both in the use of power and transistors。 Remember those are two limiters the rate at which transistor growth is going forward and the issue of power from the lack of Denard scaling。 So we’re trying to really improve the efficiency of that。

這裡的關鍵目標是在功率和電晶體方面實現更高的效率。請記住,電晶體增長的速度和登納德縮放定律是兩個限制因素,所以我們正在努力提高效率。

Good news? The good news here is that deep learning is a broadly applicable technology。 It‘s the new programming model, programming with data rather than writing massive amounts of highly specialized code。 Use data to train deep learning model to detect that kind of specialized circumstance in the data。

有什麼好訊息嗎?好訊息是深度學習是一種廣泛適用的技術。這是一種新的程式設計模型,使用資料進行程式設計,而不是編寫大量高度專業化的程式碼,而是使用資料訓練深度學習模型來發現資料中的特殊情況。

And so we have a good target domain here。 We have applications which are really demanding of massive amounts of performance increase through which we think there are appropriate domain specific architectures。

所以我們有一個很好的目標域,我們有一些真正需要大量效能提升的應用程式,因此我們認為是有合適的特定領域架構的。

It’s important to understand why these domain specific architectures can win in particular there‘s no magic here。

我們需要弄明白這些特定領域架構的優勢。

People who are familiar with the books Dave Patterson and I co-authored together know that we believe in quantitative analysis in an engineering scientific approach to designing computers。 So what makes these domain specific architectures more efficient?

熟悉大衛·帕特森和我合著的書籍的人都知道,在計算機設計上,我們信奉遵循工程學方法論的量化分析。那麼是什麼讓這些特定領域架構更高效呢?

First of all, they use a simple model for parallelism that works in a specific domain and that means they can have less control hardware。 So for example we switch from multiple instruction multiple data models in a multicore to a single instruction data model。 That means we dramatically improve the energy associated with fetching instructions because now we have to fetch one instruction rather than any instructions。

首先,他們使用一個簡單的並行模型,在特定領域工作,這意味著它們可以擁有更少的控制硬體。例如,我們從多核中的多指令多資料模型切換到單指令資料模型。這意味著我們顯著提高了與獲取指令相關的效率,因為現在我們必須獲取一條指令而不是任何指令。

We move to VLIW versus speculative out of order mechanisms, so things that rely on being able to analyze the code better know about dependences and therefore be able to create and structure parallelism at compile time rather than having to do with dynamically runtime。

我們來看看VLIW和推測性亂序機制的對比。現在需要更好處理程式碼的也能夠得知其依附性,因此能夠在編譯時建立和構建並行性,而不必進行動態執行。

Second we make more effective use of memory bandwidth。 We go to user controlled memory system rather than caches。 Caches are great except when you have large amounts of data does streaming through them。 They’re extremely inefficient that‘s not what they meant to do。 They are meant to work when the program does repetitive things but it is somewhat in predictable fashion。 Here we have repetitive things in a very predictable fashion but very large amounts of data。

其次,我們更有效地利用記憶體頻寬。我們使用使用者控制的記憶體系統而不是快取。快取是好東西,但是如果要處理大量資料的話就不會那麼好使了,效率極低,快取不是用來幹這事的。快取旨在在程式執行具有重複性、可預測的操作時發揮作用。這裡執行的運算雖然重複性高且可預測,但是資料量是在太大。

So we go to an alternative using prefetching and other techniques to move data into the memory once we get it into the memory within the processor within the domain specific processor。 We can then make heavy use of the data before moving it back to the main memory。

那我們就用個別的方式。在我們把資料匯入特定領域處理器上的記憶體之後,我們採用預提取和其他技術手段將資料匯入記憶體中。接著,在我們需要把資料導去主存之前,我們就可以重度使用這些資料。

We eliminate unneeded accuracy。 Turns out we need relatively much less accuracy then we do for general purpose computing here。 In the case of integer, we need 8-16 bit integers。 In the case of floating point, we need 16 to 32 bit not 64-bit large-scale floating point numbers。 So we get efficiency thereby making data items smaller and by making the arithmetic operations more efficient。

我們消除了不需要的準確性。事實證明,我們需要的準確度比用於通用計算的準確度要低得多。我們只需要8-16位整數,要16到32位而不是64位的大規模浮點數。因此,我們透過使資料項變得更小而提高效率。

The key is that the domain specific programming model matches the application to the processor。 These are not general purpose processor。 You are not gonna take a piece of C code and throw it on one of these processors and be happy with the results。 They’re designed to match a particular class of applications and that structure is determined by that interface in the domain specific language and the underlining architecture。

關鍵在於特定領域的程式設計模型將應用程式與處理器匹配。這些不是通用處理器。你不會把一段 C 程式碼扔到其中一個處理器上,然後對結果感到滿意。它們旨在匹配特定類別的應用程式,並且該結構由領域特定語言中的介面和架構決定。

So this just shows you an example so you get an idea of how were using silicon rather differently in these environments then we would in a traditional processor。

這裡我們來看一個例子,以便了解這些處理器與常規處理器的不同之處。

What I‘ve done here is taken a first generation TPU-1 the first tensor processing unit from Google but I could take the second or third or fourth the numbers would be very similar。 I show you what it looks like it’s a block diagram in terms of what the chip area devoted to。 There‘s a very large matrix multiply unit that can do a two 56 x 2 56 x 8 bit multiplies and the later ones actually have floating point versions of that multiplying。 It has a unified buffer used for local activations of memory buffer, interfaces accumulators, a little bit of controls and interfaces to DRAM。

這裡展示是谷歌的第一代 TPU-1 ,當然我也可以採用第二、第三或第四代,但是它們帶來的結果是非常相似的。這些看起來像格子一樣的圖就是晶片各區域的分工。它有一個非常大的矩陣乘法單元,可以執行兩個 56 x 2 56 x 8 位乘法,後者實具有浮點版本乘法。它有一個統一的緩衝區,用於本地記憶體啟用。還有介面、累加器、DRAM。

Today that would be high bandwidth DRAMs early on it with DDR3。 So if you look at the way in which the area is used。 44% of is used for memory to store temporary results in weights and things been computed。 Almost 40% of being used for compute, 15% for the interfaces and 2% for control。

在今天我們使用的是高頻寬DRAM,以前可能用的是DDR3。那我們來具體看看這些區域的分工。 44% 用於記憶體以短時間記憶體儲運算結果。 40% 用於計算,15% 用於介面,2% 用於控制元件。

Compare that to a single Skylake core from an Intel processor。 In that case, 33% as being used for cach。 So noticed that we have more memory capacity in the TPU then we have on the Skylake core。 In fact if you were to remove the caps from the cache that number because that’s overhead it‘s not real data, that number would even be larger。 The amount on the Skylake core will probably drop to about 30% also almost 50% more being used for active data。

將其與英特爾的 Skylake架構進行比較。在這種情況下,33% 用於快取。請注意,我們在 TPU 中擁有比在Skylake 核心上更多的記憶體容量,事實上,如果移除快取限制,這個數字甚至會更大。 Skylake 核心上的數量可能會下降到大約 30%,用於活動資料的數量也會增加近 50%。

30% of the area is used for control。 That’s because the Skylake core is an out of order dynamic schedule processor like most modern general purpose processors and that requires significantly more area for the control, roughly 15 times more area for control。 That control is overhead。 It’s energy intensive computation unfortunately the control unit。 So it‘s also a big power consumer。 21% for compute。

30% 的區域用於控制。這是因為與大多數現代通用處理器一樣,Skylake 核心是一個無序的動態排程處理器,需要更多的控制區域,大約是15 倍的區域。這種控制是額外負擔。不幸的是,控制單元是能源密集型計算,所以它也是一個能量消耗大戶。 21% 用於計算。

So noticed that the big advantage that exists here is the compute areas roughly almost double what it is in a Skylake core。 Memory management there’s memory management overhead and finally miscellaneous overhead。 so the Skylake core is using a lot more for control a lot less for compute and somewhat less for memory。

這裡存在的最大優勢是計算區域幾乎是 Skylake 核心的兩倍。記憶體管理有記憶體管理負擔,最後是雜項負擔。因此,控制佔據了Skylake 核心的區域,意味著用於計算的區域更少了,記憶體也是同理。

So where does this bring us? We‘ve come to an interesting time in the computing industry and I just want to conclude by reflecting on this and how saying something about how things are likely to go forward in the future because I think we’re at a real turning point at this point in the history of computing。

那麼我們現在處於一個什麼位置呢?我們來到了計算行業的一個有趣時期。我想透過分享一些我的個人思考、以及對未來的一些展望結束這場講演,因為我認為我們正處在計算領域歷史的一個轉折點。

From 1960s, the introduction of the first real commercial computers, to 1980 we had largely vertically integrated companies。

從 1960 年代第一臺真正的商用計算機的出現到 1980 年,市面上的計算機公司基本上都是垂直整合的。

IBM Burroughs Honeywell be early spin outs out of the activity at the university of Pennsylvania that built ENIAC the first electronic computer。

IBM、寶來公司、霍尼韋爾、以及其他參與了賓夕法尼亞大學制造的世界上第一臺電子計算機 ENIAC 公司都是垂直整合的公司。

IBM is the perfect example of a vertically integrated company in that period。 They did everything, they built around chips they built the round disc‘s in fact the West Coast operation of IBM here in California was originally open to do disc technology and the first Winchester discs were built on the West Coast。

IBM 是那個時期垂直整合公司的完美典範。IBM好像無所不能,他們圍繞著晶片製造,他們製造了光碟。事實上,IBM 在加利福尼亞的西海岸業務最初就是光碟技術,而第一個溫徹斯特光碟就是在西海岸製造出來的。

They built their own processors。 The 360, 370 series, etc。 After that they build their own operating system they built their own compilers。 They even built their own database estate。 They built their networking software。 In some cases, they even built application program but certainly the core of the system from the fundamental hardware up through the databases OS compilers were all built by IBM。 And the driver here was technical concentration。 IBM could put together the expertise across these wide set of things, assemble a world-class team and really optimize across the stack in a way that enabled their operating system to do things such as virtual memory long before other commercial activities can do that。

他們還構建了自己的處理器,有360、370系列等等。之後他們開發了自己的作業系統、編譯器。他們甚至建立了自己的資料庫、自己的網路軟體。他們甚至開發了應用程式。可以肯定的是,從基礎硬體到資料庫、作業系統、編譯器等系統核心都是由 IBM 自己構建的。而這裡的驅動力是技術的集中。 IBM 可以將這些廣泛領域的專業知識整合在一起、組建一個世界一流的團隊、並從而最佳化整個堆疊,使他們的作業系統能夠做到虛擬記憶體這種事,這可要比在其他公司要早得多。

And then the world changed, really changed with the introduction of the personal computer。 And the beginning of the micro processors takes off。

接著出現了重大變化——個人電腦的推出和微處理器的崛起。

Then we change from a vertically organized industry to a horizontally organized industry。 We had silicon manufacturers。 Intel for example doing processors along with AMD and initially several other companies Fairchild and Motorola。 We had a company like TSMC arise through silicon foundry making silicon for others。 Something that didn’t exist in earlier but really in the late 80s and 90s really began to take off and that enabled other people to build chips for graphics or other other functions outside the processor。

接著這個行業從垂直轉變為水平縱向的。我們有專精於做半導體的公司,例如英特爾和 AMD ,最初還有其他幾家公司例如快捷半導體和摩托羅拉。臺積電也透過代工崛起。這些在早期都是見不到的,但在 80 年代末和 90 年代開始逐漸起步,讓我們能夠做其它型別的處理器,例如圖形處理器等。

But Intel didn‘t do everything。 Intel did the processors and Microsoft then came along and did OS and compilers on top of that。 And Oracle companies like Oracle came along and build their applications databases and other applications on top of that。 So they became very horizontally organized industry。 The key drivers behind this, obviously the introduction of the personal computer。

但是英特爾並沒有一家公司包攬所有業務。英特爾專做處理器,然後微軟出現了,微軟做作業系統和編譯器。甲骨文等公司隨之出現,並在此基礎上構建他們的應用程式資料庫和其他應用程式。這個行業就變成了一個縱向發展等行業。這背後的關鍵驅動因素,顯然是個人電腦的出現。

The rise of shrinkwrap software, something a lot of us did not for see coming but really became a crucial driver, which meant that the number of architecture that you could easily support had to be kept fairly small because the software company is doing a shrink wrap software did not want to have to port and and verify that their software work done lots of different architectures。

軟體實體銷售等興起也是我們很多人沒有預料到的,但它確實成為了一個關鍵的驅動因素,這意味著必須要限制可支援的架構數量,因為軟體公司不想因為架構數量太多而需要進行大量的移植和驗證工作。

And of course the rise in the dramatic growth of the general purpose microprocessor。 This is the period in which microprocessor replaced all other technologies, including the largest super computer。 And I think it happened much faster than we expected by the mid 80s microprocessor put a series dent in the mini computer business and it was struggling by the by the early 90s in the main from business and by the mid 90s to 2000s really taking a bite out of the super computer industry。 So even the supercomputer industry converted from customize special architectures into an array of these general purpose microprocessor。 They were just far too efficient in terms of cost and performance to be to be ignored。

當然還有通用微處理器的快速增長。這是微處理器取代所有其他技術的時期,包括最大的超級計算機。我認為它發生的速度比我們預期的要快得多,因為 80 年代中期,微處理器對微型計算機業務造成了一系列影響。到 90 年代初主要業務陷入困境,而到 90 年代中期到 2000 年代,它確實奪走了超級計算機行業的一些市場份額。因此,即使是超級計算機行業,也從定製的特殊架構轉變為一系列的通用微處理器,它們在成本和效能方面的效率實在是太高了,不容忽視。

Now we’re all of a sudden in a new area where the new era not because general purpose processor is that gonna go completely go away。 They going to remain to be important but they‘ll be less centric to the drive to the edge to the ferry fastest most important applications with the domain specific processor will begin to play a key role。 So rather than perhaps so much a horizontal we will see again a more vertical integration between the people who have the models for deep learning and machine learning systems the people who built the OS and compiler that enabled those to run efficiently train efficiently as well as be deployed in the field。

現在我們突然進入了一個新時代。這並不意味著通用處理器會完全消失,它們仍然很重要,但它們將不是驅動行業發展的主力,能夠與軟體快速聯動的特定領域處理器將會逐漸發揮重大作用。因此,我們接下來或許會看到一個更垂直的行業,會看到擁有深度學習和機器學習模型的開發者,與作業系統和編譯器的開發者之間更垂直的整合,使他們的程式能夠有效執行、有效地訓練以及進入實際使用。

Inference is a critical part is it mean when we deploy these in the field will probably have lots of very specialized processors that do one particular problem。 The processor that sits in a camera for example that’s a security camera that‘s going to have a very limited used。 The key is going to be optimize for power and efficiency in that key use and cost of course。 So we see a different kind of integration and Microsoft Google and Apple are all looking at this。

程式推理是一個關鍵部分,這意味著當我們進行部署時,可能會有很多非常專業的處理器來處理一個特定的問題。例如,位於攝像頭中的處理器用途就非常有限。當然,關鍵是最佳化功耗和成本。所以我們看到了一種不同的整合方案。微軟、谷歌和蘋果都在關注這個領域。

The Apple M1 is a perfect example if you look at the Apple M1, it’s a processor designed by apple with a deep understanding of the applications that are likely to run on that processor。 So they have a special purpose graphics processor they have a special purpose machine learning domain accelerator on there and then they have multiple cores, but even the cores are not completely homogeneous。 Some are slow low power cores, and some are high speed high-performance higher power cores。 So we see a completely different design approach with lots more codesign and vertical integration。

例如Apple M1,Apple M1 就是一個完美的例子,它是由 蘋果設計的處理器,對蘋果電腦上可能執行的程式有著極好的最佳化。他們有一個專用的圖形處理器、專用的機器學習領域加速器、有多個核心。即使是處理器核心也不是完全同質的,有些是功耗低的、比較慢的核心,有些是高效能高功耗的核心。我們看到了一種完全不同的設計方法,有更多的協同設計和垂直整合。

We‘re optimizing in a different way than we had in the past and I think this is going to slowly but surely change the entire computer industry, not the general purpose processor will go away and not the companies that make software that runs on multiple machines will completely go away but will have a whole new driver and the driver is created by the dramatic breakthroughs that we seen in deep learning and machine learning。 I think this is going to make for a really interesting next 20 years。

我們正在以與過去不同的方式進行最佳化,這會是一個緩慢的過程,但肯定會改變整個計算機行業。我不是說通用處理器會消失,也不是說做多平臺軟體的公司將消失。我想說的是,這個行業會有全新的驅動力,由我們在深度學習和機器學習中看到的巨大突破創造的驅動力。我認為這將使未來 20 年變得非常有趣。

Thank you for your kind attention and I’d like to wish the 2021 T-EDGE conference a great success。 Thank you。

最後,你耐心地聽完我這次演講。我也預祝 2021 年 T-EDGE 會議取得圓滿成功,謝謝。

(本文首發鈦媒體App)

年終理財爆款福利!領取8%+理財券,每日限額2000份,先到先得!

相關文章

頂部