By 苏剑林 | November 06, 2019
Over the past two weeks, I have invested a lot of energy into the development of bert4keras. Besides some API standardization work, the primary workload was building the pre-training section of the code. Yesterday, the pre-training code was basically completed and tested successfully in both TPU and multi-GPU environments. This provides another option for students who are motivated (and have the computing power) to improve pre-trained models. This might be the most clear and easy-to-understand code for BERT and its pre-training available currently.
Pre-training code link: https://github.com/bojone/bert4keras/tree/master/pretraining
After these two weeks of development (and filling in holes), my biggest takeaway is that Keras has already become the gold standard for TensorFlow. As long as your code is written according to Keras standards and specifications, it can be easily migrated to tf.keras and then very easily trained in TPU or multi-GPU environments—truly almost a once-and-for-all solution. Conversely, if your style of writing is too flexible, including many "grafting" style Keras tricks I introduced before, you might encounter quite a few problems. It might even happen that even if you've got it running on multiple GPUs, you might struggle to get it working on a TPU no matter what.
Support Without Reservation
Everyone says TensorFlow 2.0 promotes tf.keras as the main component, but in fact, since TensorFlow 1.14, tf.keras has already become its gold standard. So if you want to experience how Father Google supports Keras without reservation, you only need TensorFlow 1.14+; you don't necessarily have to upgrade to 2.0. Currently, the bert4keras code supports both the original Keras and tf.keras. In typical single-card fine-tuning tasks, you can use either Keras or tf.keras, but if you want to use multi-GPU or even TPU training, the best choice is tf.keras.
To get started with tf.keras, I first recommend referring to a very good website: https://tf.wiki/
In tf.keras, converting a model from single-card to multi-card training is very simple:
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
model = create_a_model()
model.compile(loss='mse', optimizer='adam')
model.fit(train_x, train_y, epochs=10)
In other words, you only need to define a strategy, and then models built within the scope of this strategy become multi-card models. Multi-card training has never been this simple.
By the way, Keras itself comes with the multi_gpu_model function to implement multi-GPU training, but based on my personal testing, multi_gpu_model doesn't work very well and sometimes fails to take effect. In short, I still recommend using tf.keras. Additionally, the example above is for single-machine multi-card setup; multi-machine multi-card is similar, but I don't have the environment to test it, so I haven't provided an example. For those who want to test it, please refer to the introduction in https://tf.wiki/.
What about TPUs? It's just as simple—just replace the strategy (note that TensorFlow 2.0 has some minor changes):
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu=tpu_address)
tf.config.experimental_connect_to_host(resolver.master())
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)
Does it feel incredibly simple? By now, everyone should understand what I meant by supporting Keras "without reservation." Starting from TensorFlow 1.14, as long as you use standard Keras writing style, you can be successful in almost everything.
What is Considered "Standard"?
I have repeatedly emphasized using standard Keras writing style. So, what counts as standard? Here is a summary of some experiences.
1. Use Keras's built-in layers, loss functions, and optimizers as much as possible to implement the functions we need. If a model entirely uses Keras's built-in components, it essentially guarantees it will run on multi-GPU or TPU.
2. If you want to customize layers, follow the specifications strictly, especially writing the get_config method well. A way to test if your writing is standardized is to use your custom layer to build a model and see if the clone_model function can successfully clone that model. If it can, your layer definition is standardized.
3. If you want to use TPU training, do not use add_loss at the end of the model to define custom losses, and do not use add_metric to add metrics. If you need to define complex losses or metrics, please define them as the output of a layer, referring to this style of writing.
4. If you want to use TPU training, avoid using dynamic (variable-length) logic during the training process. For example, when using tf.where, the x and y parameters cannot be None, otherwise the result length of tf.where is uncertain. Also, almost all functions in TensorFlow with the word dynamic in them cannot be used.
As you can see, the so-called "standard" is actually imitating existing Keras styles to the greatest extent possible, and minimizing self-invented methods. As long as you follow points 1 and 2, you can easily use multi-GPU training under tf.keras; points 3 and 4 are specifically for avoiding pitfalls with TPUs. Generally speaking, everything must be static.
Although TensorFlow 2.0 has defaulted to eager execution (dynamic graphs), I do not recommend using it. Personally, I believe we should get used to the model construction workflow of static graphs. While dynamic graphs are convenient for debugging, they make us heavily dependent on immediate output results, reducing our ability to debug complex problems. Similarly, I do not recommend using tools like code completion or code hints; these tools make us too reliant and prevent us from truly understanding the functions themselves. (Personal opinion, please don't flame.)
The Victory of User-Friendliness
If I remember correctly, I first came into contact with Keras in early 2015. At that time, there weren't many deep learning frameworks. I just wanted to find a handy tool to implement a few simple models, so I found Keras and have been using it ever since. Not just me, but perhaps even the authors of Keras did not expect back then that Keras would become the gold standard for TensorFlow today.
I feel this is no accident. TensorFlow has had many high-level API frameworks, such as TensorLayer, tf.slim, and TFLearn, but why was Keras eventually chosen? Beyond Keras's "long history," it lies in the fact that Keras is truly a worthy and elegant implementation of encapsulation. Over this past year, I occasionally read the Keras source code, and every time I do, I am struck by its rigor and elegance. It is a well-deserved work of art, a humane and user-friendly creation.
Therefore, this is the victory of user-friendliness (humanity).